Cell-free dna methylation patterns for disease and condition analysis

ABSTRACT

Disclosed herein are methods and systems of utilizing sequencing reads for detecting and quantifying the presence of a tissue type or a disease type in cell-free DNA prepared from blood samples.

PRIORITY

This application is a continuation application of U.S. patentapplication Ser. No. 16/307,821 filed Dec. 6, 2018, which is a nationalphase application under 35 U.S.C. § 371 of International PatentApplication PCT/IB2017/053378 filed on Jun. 7, 2017, which claimspriority to U.S. Provisional Patent Application 62/347,010 filed on Jun.7, 2016; U.S. Provisional Patent Application 62/473,829 filed on Mar.20, 2017; and U.S. Provisional Patent Application 62/491,560 filed onApr. 28, 2017, all of which are hereby incorporated by reference intheir entireties.

STATEMENT REGARDING FEDERALLY-SPONSORED RESEARCH

This invention was made with government support under Grant No. HL108634awarded by the National Institutes of Health (NIH). The government hascertain rights in the invention.

FIELD OF THE INVENTION

The invention disclosed herein generally relates to method for analyzingsequencing data of nucleic acid samples (e.g., cell-free DNA samples).It also relates to methods of cancer diagnosis and prognosis, includingthe identification, origin, and location of a cancer.

BACKGROUND

Unlike traditional biopsies involving invasive surgery, the liquidbiopsies utilize only blood samples obtained with minimal invasiveness.Blood is the only biological material that has contact with almost allhuman organs (including tumors and inflammatory tissues) through thehuman circulation system. Therefore, the blood carries a large amount ofvaluable information and disease signs with regard to the status of manyorgans. For example, in the blood plasma, cell-free circulating DNAs(abbreviated as cfDNAs), the degraded DNA fragments released fromapoptotic or necrotic cells in many organs, are regarded as the mixtureof DNAs from many normal tissue cells and diseased cells (e.g.,cancerous tumor cells). Therefore, they are one of the best sources forblood-based cancer diagnosis and have recently become of major interestfor blood-based cancer diagnosis.

However, DNA fragments from diseased cells often constitute only a smallportion of the cfDNA samples, especially at an early stage of thedisease. As such, sequencing information representing diseased DNA isoften drowned out by sequencing information representing normal DNA.What is needed are methods and/or systems for selectively andsensitively deciphering sequencing information relating to diseased DNA.

Early detection and identification of cancer are highly desired.Traditionally, identification of cancer involves invasive tissue biopsyprocedures. There exist no method or apparatus to provide accuratescreening and identification of tissue-of-origin of cancer withnon-invasive method when the cancer is in an early stage.

Early detection of cancer-before it has had a chance tometastasize-presents the best strategy for increasing cancer survival.Recently, cancer detection using cell-free DNA (cfDNA) from blood hasattracted significant interest due to its non-invasive nature. However,tumor cfDNA levels are very low in most early-stage and many advancedstage cancer patients (Bettegowda et al., 2014; Newman et al., 2014).Therefore, the major challenge in cfDNA-based early cancer diagnosticsis how to identify the tiny amount of tumor cfDNAs out of total cfDNAsin blood. The mainstream approach to address this challenge ismutation-based, i.e., using targeted deep sequencing (>5,000×coverage),combined with error-suppression techniques, to call cfDNA mutations in asmall gene panel (Bettegowda et al., 2014; Newman et al., 2014; Newmanet al., 2016). While this approach provides a sensitive way to monitorcancer recurrence when the mutations are known, a small gene panel couldnot serve diagnostic purposes because mutations can be wide-spread andvery heterogeneous, even in the same type of cancer (Burrell et al.,2013; Tumer et al., 2012; Greenman et al., 2007; Schmitt et al., 2012).However, enlarging the gene panel, while maintaining the sequencingdepth, is cost-prohibitive. Therefore, there remains the challenge ofdetecting the trace amount of tumor cfDNA using a different approach,namely, using the cfDNA methylation patterns.

This disclosure discloses different embodiments of machines,apparatuses, computer products, and methods for screening cancers andidentifying the tissue-of-origin of cancer cells using blood samplesdrawn from patients when the cancer is in an early stage.

SUMMARY OF THE INVENTION

In one aspect, provided herein is a method of characterizing a cell-freeDNA (cfDNA) sample from a subject. In some embodiments, the methodcomprises the steps of receiving a plurality of sequencing reads for acfDNA sample from the subject, wherein each sequencing read comprisesmethylation sequencing data obtained from a consecutive nucleic acidsequence of 50 or more nucleic acids; calculating a methylation patternbased on a sequencing read in the plurality, wherein the methylationpattern comprises a genomic region corresponding to the consecutivenucleic acid sequence and methylation status of one or more motifs inthe genomic region; comparing the methylation pattern with each of oneor more pre-established methylation signatures to compute one or morelikelihood scores, wherein each of the one or more pre-establishedmethylation signatures correlates with a biological composition, andwherein each pre-established methylation signature comprises at leastone pre-determined signature region and pre-determined methylation rateassociated therewith; and characterizing the sequencing read ascontaining the biological composition if at least one of the one or morelikelihood scores exceeds a threshold value.

In some embodiments, the method further comprises a step of repeatingthe comparing and characterizing steps for each sequencing read in theplurality.

In some embodiments, the method further comprises a step of establishingthe one or more pre-established methylation signatures based on existingmethylation sequencing data (e.g., both array-based and sequencingdata).

In some embodiments, the method further comprises a step of determininga level of the biological composition in the cfDNA sample based on thenumber of sequencing reads containing the biological composition in theplurality of sequencing reads.

In some embodiments, the existing methylation sequencing data isselected from the group consisting of tissue specific sequencing data,disease specific sequencing data, individual sequencing data, populationsequencing data, and combinations thereof.

In some embodiments, the cfDNA sample is prepared from a plasma or bloodsample from the subject. The biological sample may be any biologicalliquid such as saliva, amniotic fluid, cystic fluid, spinal or brainfluid, urine, sweat, or tears. It may contain contaminating amounts ofcells, such as cells in an amount that is at most or less than about 1,10, 100, 1000, or 10000 intact cells (on average) per microliter ofliquid (or any range derivable therein).

In some embodiments, the biological composition is selected from thegroup consisting of diseased tissue, cancer tissue, tissue from aspecific organ, liver tissue, lung tissue, kidney tissue, colon tissue,T-cells, B-cells, neutrophils, small intestines tissue, pancreas tissue,adrenal glands tissue, esophagus tissue, adipose tissue, heart tissue,brain tissue, placenta tissue, and combinations thereof. Methods,computer programs, and apparatuses described herein can be applied toany disease or condition in which a difference exists in the methylationpattern of cell-free DNA from affected versus unaffected individuals orindividuals at a different stage of the disease or condition or having adifferent prognosis. For example, one can identify the abnormalmethylation patterns of cell-free DNAs from oligodendrocyte to diagnosemultiple sclerosis, from Pancreatic 3-cells to diagnose diabetes Type I,and from pancreatic cells to diagnose pancreatitis based on data toderive the methylation signatures of these diseases. Therefore, in someembodiments, obtaining or generating a methylation profile of cell freeDNA from biological samples with the disease are included. In otherembodiments, obtaining or generating a methylation profile of cell freeDNA from biological samples without the disease or considereddisease-free are included.

In some embodiments, the cancer tissue is selected from a groupconsisting of liver cancer tissue, lung cancer tissue, kidney cancertissue, colon cancer tissue, brain cancer tissue, pancreas cancertissue, brain cancer tissue, gastrointestinal cancer tissue, head andneck cancer tissue, bone cancer tissue, tongue cancer tissue, gum cancertissue, and combinations thereof. In other embodiments, the tissue isselected from a group consisting of liver tissue, brain tissue, lungtissue, kidney tissue, colon tissue, pancreas tissue, brain tissue,gastrointestinal tissue, head and neck tissue, bone, tongue tissue, gumtissue, and combinations thereof.

In some embodiments, the methylation status and pre-determinedmethylation status is determined at bin level.

In some embodiments, the methylation status and pre-determinedmethylation status is determined at CpG site level.

In some embodiments, the one or more motifs is a CpG site.

In some embodiments, wherein the method further comprises comparing thelevel of the biological composition in the cfDNA of the subject to thatof the biological composition in a normal subject or a known cancerpatient, or a patient known to be affected or afflicted with aparticular disease or condition.

In some embodiments, the level of the biological composition in thenormal subject or known cancer or other disease patient has beenpreviously determined using the same method or a different method.

In one aspect, provided herein are method for comparing the level of abiological composition from a normal subject to that of the samebiological composition from a potential patient. Here, the method asdisclosed herein can be used to determine the level of the biologicalcomposition using cfDNA samples from both the normal subject and apotential patient.

In one aspect, provided herein are method for comparing the level of abiological composition from a known cancer patient to that of the samebiological composition from a potential patient. In another aspect,provided herein are method for comparing the level of a biologicalcomposition from a patient with a known disease or condition to that ofthe same biological composition from a potential patient Here, themethod as disclosed herein can be used to determine the level of thebiological composition using cfDNA samples from both the normal subjectand a potential patient.

It would be understood that cfDNAs from a known patient of any diseasecan be used as standard for disease diagnosis. It is contemplated thatany embodiment discussed herein with respect to cancer can beimplemented with respect to any other disease or condition in whymethylation profiles in cfDNA differ between normal or non-diseasedindividuals and diseased individuals.

In one aspect, provided herein is a method of comparing the level of abiological composition in a cell-free (cfDNA) sample from an unknownsubject to that of the same biological composition in a normal subjector a known cancer patient. The method comprises the steps of: receivinga first plurality of sequencing reads for a cfDNA sample from theunknown subject, wherein each sequencing read comprises methylationsequencing data obtained from a consecutive nucleic acid sequence of 50or more nucleic acids; i) calculating a methylation pattern based on asequencing read in the first plurality, wherein the methylation patterncomprises a genomic region corresponding to the consecutive nucleic acidsequence and methylation status of one or more motifs in the genomicregion; ii) comparing the methylation pattern with each of one or morepre-established methylation signatures to compute one or more likelihoodscores, wherein each of the one or more pre-established methylationsignatures correlates with a biological composition, and wherein eachpre-established methylation signature comprises at least onepre-determined signature region and pre-determined methylation rateassociated therewith; iii) characterizing the sequencing read ascontaining the biological composition if at least one of the one or morelikelihood scores exceeds a threshold value; iv) repeating thecalculating, comparing and characterizing steps for each sequencing readin the first plurality; v) determining a first level of the biologicalcomposition in the cfDNA sample from the unknown subject based on thenumber of sequencing reads containing the biological composition in thefirst plurality of sequencing reads; receiving a second plurality ofsequencing reads for a cfDNA sample from the normal subject or knowncancer patient, wherein each sequencing read comprises methylationsequencing data obtained from a consecutive nucleic acid sequence of 50or more nucleic acids; determining a second level of the biologicalcomposition in the cfDNA sample from the patient by carrying out stepsi) through v) on the cfDNA sample from the normal subject or knowncancer patient; and comparing the first level and second level of thebiological composition.

In one aspect, provided herein is a method of detecting changes incomposition of a cell-free DNA (cfDNA) sample from a patient. The methodcomprises the steps of receiving, at a first time point, a firstplurality of sequencing reads for a first cfDNA sample from the patient,wherein each sequencing read in the first plurality comprisesmethylation sequencing data obtained from a first consecutive nucleicacid sequence of 50 or more nucleic acids; i) calculating a firstmethylation pattern based on a sequencing read in the first plurality,wherein the first methylation pattern comprises a first genomic regioncorresponding to the first consecutive nucleic acid sequence andmethylation status of one or more motifs in the first genomic region;ii) comparing the first methylation pattern with each of one or morepre-established methylation signatures to compute one or more firstlikelihood scores, wherein each of the one or more pre-establishedmethylation signatures correlates with a biological composition, andwherein each pre-established methylation signature comprises at leastone pre-determined signature region and pre-determined methylation rateassociated therewith; iii) characterizing the cfDNA as containing thebiological composition if at least one of the one or more firstlikelihood scores exceeds a threshold value; iv) repeating steps i) toiii) for each sequencing read in the first plurality of sequencing readsto quantitate the presence of the biological composition in the cfDNAsample at the first time point; receiving, at a second time point, asecond plurality of sequencing reads for a second cfDNA sample from thesame patient, wherein each sequencing read in the second pluralitycomprises methylation sequencing data obtained from a second consecutivenucleic acid sequence of 50 or more nucleic acids; repeating steps i) toiv) for each sequencing read in the second plurality of sequencing readsto quantitate the presence of the biological composition in the cfDNAsample at the second time point; and detecting a change in thebiological composition between the first and second time points.

In some embodiments, the biological composition is selected from thegroup consisting of diseased tissue, cancer tissue, tissue from aspecific organ, liver tissue, lung tissue, kidney tissue, colon tissue,T-cells, B-cells, neutrophils, small intestines tissue, pancreas tissue,adrenal glands tissue, esophagus tissue, adipose tissue, heart tissue,brain tissue, placenta tissue, and combinations thereof. In someembodiments, the patient may be determined to have a disease orcondition or be determined specifically not to have the disease orcondition.

In particular, cancer cells can often display aberrant DNA methylationpatterns, such as hypermethylation of the promoter regions of tumorsuppressor genes and pervasive hypomethylation of intergenic regions.Therefore, a patient's DNA methylation profile can be a target forcancer evaluation in clinical practice. Hyper/hypo-methylated tumor DNAfragments can be released into the bloodstream via cell apoptosis ornecrosis, where these circulation tumor DNA (ctDNA) become part of thecirculating cell-free DNA (cfDNA) in plasma. The non-invasive nature ofcfDNA methylation profiling may be an effective strategy for generalcancer screening.

In developing embodiments for non-invasive cancer screening andidentifying the tumor tissue-of-origin, detection and characterizationof cell-free DNA in plasma can be an effective method. Liquid biopsy,e.g., blood draw, unlike traditional tissue biopsy, has the potential todiagnose a variety of different malignancies.

The disclosure herein provides, in some embodiments, a probabilisticmethod that evaluates a patient for cancer using cell-free DNA (cfDNA),including identifying the location of the cancer and/or tumors. Theembodiment simultaneously identifies the proportions and thetissue-of-origin of tumor-derived cell-free DNA in a blood sample usinggenome-wide DNA methylation data. The disclosure comprehensivelydiscloses the embodiments with simulations and real data, and comparesperformances of the embodiments. This disclosure shows that thepredicted tumor burdens are highly consistent with the true values.Notably, the embodiments disclosed herein achieve accurate results onpatient plasma samples, despite the fact that the DNA methylation datafrom these samples has very low sequencing coverage. Such ability toaccurately identify the existence as well as the location of tumors ishighly desirable in cancer therapy.

According to one embodiment, a computer program product includes anon-transitory computer-readable medium having instructions configuredfor cancer detection and tissue-of-origin identification, which, whenexecuted by a processor of a computing system, cause the processor toperform the steps of: receiving an instruction to access data of a cellfree DNA (cfDNA) methylation profile of a patient stored in thenon-transitory computer-readable medium; identifying a plurality of CpGcluster features in the cfDNA methylation profile wherein a total numberof the plurality of CpG cluster features is K, K is a positive integer;determining a circulating tumor DNA (ctDNA) burden coefficient θ, where0≤θ≤1; determining a potential cancer type t; estimating a methylationlevel x_(k) for each of the CpG cluster features, where k=1, 2, . . . K;calculating a prediction score λ using θ, t, and x_(k); determining thepatient is cancerous having the potential cancer type t, if λ is greaterthan a predetermined threshold; and determining the patient isnoncancerous, if λ is smaller than the predetermined threshold.

According to another embodiment, an apparatus configured for cancerdetection and tissue-of-origin identification, includes a non-transitorymemory and a processor coupled to the non-transitory memory, theprocessor configured to execute steps of: accessing data of a cell freeDNA (cfDNA) methylation profile of a patient stored in thenon-transitory memory; identifying a plurality of CpG cluster featuresin the cfDNA methylation profile wherein a total number of the pluralityof CpG cluster features is K, K is a positive integer; determining acirculating tumor DNA (ctDNA) burden coefficient θ, where 0≤θ≤1;determining a potential cancer type t; estimating a methylation levelx_(k) for each of the CpG cluster features, where k=1, 2, . . . K;calculating a prediction score λ using θ, t, and x_(k); determining thepatient is cancerous having the potential cancer type t, if λ is greaterthan a predetermined threshold; and determining the patient isnoncancerous, if λ is smaller than the predetermined threshold. Asdiscussed above, this and other embodiments discussed herein may beapplied to a disease other than cancer.

According to yet another embodiment, a method for cancer detection andtissue-of-origin identification executed by a computer system includesreceiving, by a processor of the computer system, an instruction toaccess data of a cell free DNA (cfDNA) methylation profile of a patientstored in a non-transitory computer-readable medium, the non-transitorycomputer-readable medium is in communication with the processor;identifying, by the processor, a plurality of CpG cluster features inthe cfDNA methylation profile wherein a total number of the plurality ofCpG cluster features is K, K is a positive integer; determining, by theprocessor, a circulating tumor DNA (ctDNA) burden coefficient θ, where0≤θ≤1; determining, by the processor, a potential cancer type t;estimating, by the processor, a methylation level x_(k) for each of theCpG cluster features, where k=1, 2, . . . K; calculating, by theprocessor, a prediction score λ using θ, t, and x_(k); determining, bythe processor, the patient is cancerous having the potential cancer typet, if λ is greater than a predetermined threshold; and determining, bythe processor, the patient is noncancerous, if λ is smaller than thepredetermined threshold.

In some embodiments, a best solution of the burden θ for P(R|θ, M) isfound through a grid search. In some embodiments, the grid search canalso be performed using the higher resolution step (0.01%), such as0.010%, 0.020%, 0.025%, 0.030%, 0.040%, 0.050%, 0.060%, 0.070%, 0.080%,0.090%, 0.100%, 0.125%, 0.150%, 0.175%, 0.200%, 0.225%, 0.250%, 0.275%,0.300%, 0.325%, 0.350%, 0.375%, 0.400%, 0.425%, 0.450%, 0.475%, 0.500%,0.525%, 0.055%, 0.575%, 0.600%, 0.625%, 0.650%, 0.675%, 0.700%, 0.725%,0.750%, 0.775%, 0.800%, 0.825%, 0.850%, 0.875%, 0.900%, 0.925%, 0.950%,0.975%, 1.0%, 1.1%, 1.2%, 1.3%, 1.4%, 1.5%, 1.6%, 1.7%, 1.8%, 1.9%,2.0%, 2.1%, 2.2%, 2.3%, 2.4%, 2.5%, 2.6%, 2.7%, 2.8%, 2.9%, 3.0%, 3.1%,3.2%, 3.3%, 3.4%, 3.5%, 3.6%, 3.7%, 3.8%, 3.9%, 4.0%, 4.1%, 4.2%, 4.3%,4.4%, 4.5%, 4.6%, 4.7%, 4.8%, 4.9%, 5.0%, 5.1%, 5.2%, 5.3%, 5.4%, 5.5%,5.6%, 5.7%, 5.8%, 5.9%, 6.0%, 6.1%, 6.2%, 6.3%, 6.4%, 6.5%, 6.6%, 6.7%,6.8%, 6.9%, 7.0%, 7.1%, 7.2%, 7.3%, 7.4%, 7.5%, 7.6%, 7.7%, 7.8%, 7.9%,8.0%, 8.1%, 8.2%, 8.3%, 8.4%, 8.5%, 8.6%, 8.7%, 8.8%, 8.9%, 9.0%, 9.1%,9.2%, 9.3%, 9.4%, 9.5%, 9.6%, 9.7%, 9.8%, 9.9%, 10%, 11%, 12%, 13%, 14%,15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%,29%, 30%, 35%, 40%, 45%, 50%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%,999%, and/or 100%, or any range derivable therein. In some embodiments,the grid search can be performed using the higher resolution step(0.001%), such as 0.001%, 0.002%, 0.003%, 0.004%, 0.005%, 0.006%,0.007%, 0.008%, 0.009%, 0.010%, 0.020%, 0.025%, 0.030%, 0.040%, 0.050%,0.060%, 0.070%, 0.080%, 0.090%, 0.100%, 0.125%, 0.150%, 0.175%, 0.200%,0.225%, 0.250%, 0.275%, 0.300%, 0.325%, 0.350%, 0.375%, 0.400%, 0.425%,0.450%, 0.475%, 0.500%, 0.525%, 0.055%, 0.575%, 0.600%, 0.625%, 0.650%,0.675%, 0.700%, 0.725%, 0.750%, 0.775%, 0.800%, 0.825%, 0.850%, 0.875%,0.900%, 0.925%, 0.950%, 0.975%, 1.0%, 1.1%, 1.2%, 1.3%, 1.4%, 1.5%,1.6%, 1.7%, 1.8%, 1.9%, 2.0%, 2.1%, 2.2%, 2.3%, 2.4%, 2.5%, 2.6%, 2.7%,2.8%, 2.9%, 3.0%, 3.1%, 3.2%, 3.3%, 3.4%, 3.5%, 3.6%, 3.7%, 3.8%, 3.9%,4.0%, 4.1%, 4.2%, 4.3%, 4.4%, 4.5%, 4.6%, 4.7%, 4.8%, 4.9%, 5.0%, 5.1%,5.2%, 5.3%, 5.4%, 5.5%, 5.6%, 5.7%, 5.8%, 5.9%, 6.0%, 6.1%, 6.2%, 6.3%,6.4%, 6.5%, 6.6%, 6.7%, 6.8%, 6.9%, 7.0%, 7.1%, 7.2%, 7.3%, 7.4%, 7.5%,7.6%, 7.7%, 7.8%, 7.9%, 8.0%, 8.1%, 8.2%, 8.3%, 8.4%, 8.5%, 8.6%, 8.7%,8.8%, 8.9%, 9.0%, 9.1%, 9.2%, 9.3%, 9.4%, 9.5%, 9.6%, 9.7%, 9.8%, 9.9%,10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%,24%, 25%, 26%, 27%, 28%, 29%, 30%, 35%, 40%, 45%, 50%, 60%, 65%, 70%,75%, 80%, 85%, 90%, 95%, 999%, and/or 100%, or any range derivabletherein.

Embodiments concern a patient who has symptoms of cancer, isasymptomatic of cancer, has a family or patient history of cancer, is atrisk for cancer, or who has been diagnosed with cancer. A patient may bea mammalian patient though in most embodiments the patient is a human.The cancer may be malignant, benign, metastatic, or a precancer. It maybe Stage I, II<III, or IV. It may be recurrent and/or chemo- orradiation-resistant. In still further embodiments, the cancer ismelanoma, non-small cell lung, small-cell lung, lung, hepatocarcinoma,retinoblastoma, astrocytoma, glioblastoma, gum, tongue, leukemia,neuroblastoma, head, neck, breast, pancreatic, prostate, renal, bone,testicular, ovarian, mesothelioma, cervical, gastrointestinal, lymphoma,brain, colon, sarcoma or bladder. The cancer may include a tumorcomprised of tumor cells. Embodiments concern a patient who has symptomsof a particular disease or condition, is asymptomatic of a particulardisease or condition, has a family or patient history of a disease orcondition, is at risk for a disease or condition, or who has beendiagnosed with the disease or condition.

In some embodiments, there are methods for treating cancer in a cancerpatient comprising administering to the patient an effective amount ofchemotherapy, radiation therapy, or immunotherapy (or a combinationthereof) after the patient has been determined to have cancer based onmethods disclosed herein. The point of origin of the cancer may bedetermined, in which case, the treatment is tailored to cancer of thatorigin. In some embodiments, tumor resection is performed as thetreatment or may be part of the treatment with one of the othertreatments. Examples of chemotherapeutics include, but are not limitedto, the following: alkylating agents such as bifunctional alkylators(for example, cyclophosphamide, mechlorethamine, chlorambucil,melphalan) or monofunctional alkylators (for example, dacarbazine(DTIC), nitrosoureas, temozolomide (oral dacarbazine)); anthracyclines(for example, daunorubicin, doxorubicin, epirubicin, idarubicin,mitoxantrone, valrubicin; taxanes, which disrupt the cytoskeleton (forexample, paclitaxel, docetaxel, abraxane, taxotere); epothilones;histone deacetylase inhibitors (for example, vorinostat, romidepsin);Topoisomerase I inhibitors (for example, irinotecan, topotecan);Topoisomerase II inhibitors (for example, etoposide, teniposide,tafluposide); kinase inhibitors (for example, bortezomib, erlotinib,gefitinib, imatinib, vemurafenib, vismodegib); nucleotide analogs andnucleotide precursor analogs (for example, azacitidine, azathioprine,capecitabine, cytarabine, doxifluridine, fluorouracil, gemcitabine,hydroxyurea, mercaptopurine, methotrexate, tioguanine (formerlythioguanine); peptide antibiotics (for examples, bleomycin,actinomycin); platinum-based antineoplastics (for example, carboplatin,cisplatin, oxaliplatin); retinoids (for example, retinoin, alitretinoin,bexarotene); and, vinca alkaloids (for example, vinblastine,vincristine, vindesine, vinorelbine). Immunotherapies include, but arenot limited to, cellular therapy such as dendritic cell therapy (forexample, involving chimeric antigen receptor); antibody therapy (forexample, Alemtuzumab, Atezolizumab, Ipilimumab, Nivolumab, Ofatumumab,Pembrolizumab, Rituximab or other antibodies with the same target as oneof these antibodies, such as CTLA-4, PD-1, PD-L1, or other checkpointinhibitors); and, cytokine therapy (for example, interferon orinterleukin).

In certain embodiments, there are methods of diagnosing a patient basedon determining whether the patient has a methylation profile indicativeof cancer or another disease or condition. In some embodiments, methodsinvolve generating a methylation profile that indicates whether thepatient has cancer or another disease or condition, and if so, from whatorgan. In certain embodiments, this is done using a biological samplefrom the patient that comprises cell free DNA.

Methods may further involve performing a biopsy, doing a CAT scan, doinga mammogram, performing ultrasound, or otherwise evaluating tissuesuspected of being cancerous after determining the patient's methylationprofile. In some embodiments, cancer that is found is classified in acancer classification.

Cell free DNA (cfDNA) in plasma is a good target for detecting cancer,though it is not limited to detecting cancer. Plasma cfDNA includes DNAfrom both healthy cell and tumor cell. In general, in plasma cfDNA, theportion of cfDNA derived from tumor cells is much less than healthycell. Thus, the challenge of using plasma cfDNA to detect cancer is howto accurately detect the very low amount of cfDNA derived from tumorcells.

Traditional DNA methylation analysis focuses on the methylation rate ofan individual CpG site in a cell population. This rate, often calledβ-value, is the proportion of cells in which the CpG site is methylated.However, such population-average measures are not sensitive enough tocapture an abnormal methylation signal with a small amount of cfDNAsfrom tumor cells. FIG. 1 is an example 100 of plasma cfDNA illustratingthis point.

Example 100 includes a normal plasma cfDNAs with β_(normal)=1. Theexample 100 includes liver tumor-derived cfDNAs with β_(tumor)=0. Theexample 100 includes a plasma cfDNAs, which is a mixture of 99% normalplasma cfDNAs and 1% liver tumor-derived cfDNAs. Thus, the plasma cfDNAshas a β_(mixed)=0.99. Current technologies cannot reliably differentiateβ_(mixed)=0.99 from β_(normal)=1.

The following embodiments of apparatuses and methods aim to providesolutions to the reliably detect small amount of tumor-derived cfDNAs,such as the example 100 in FIG. 1 .

In one embodiment, a computer program product including a non-transitorycomputer-readable medium having instructions configured for detecting acancer in a patient at a resolution of single reads, which, whenexecuted by a processor of a computing system, cause the processor toperform the steps comprising: retrieving an N number of reads of thepatient cell free DNA (cfDNA) methylation profile, N being a positiveinteger; identifying a J number of CpG clusters in the cfDNA methylationprofile, J being a positive integer; retrieving a K number of DNAmethylation markers of a cancer, K being a positive integer; determiningthe K number of marker regions in the cfDNA methylation profile, whereinthe marker regions are the CpG clusters that correspond to the DNAmethylation markers of the cancer; retrieving a T-class methylationpattern of each marker region, expressed as m_(k) ^(T), wherein mdenotes marker region, T denotes T-class, k=1,2, . . . K, the T-classmethylation pattern is a methylation pattern derived from cfDNAs oftumor cells of the cancer; retrieving an N-class methylation pattern ofeach marker region, expressed as m_(k) ^(N), wherein m denotes markerregion, N denotes N-class, k=1,2, . . . K, wherein the N-classmethylation pattern is a methylation pattern derived from cfDNAs ofnormal cells; and calculating a burden θ based on the N number of readsof the cfDNA methylation profile, the K number of m_(k) ^(T), and Knumber of m_(k) ^(N).

In another embodiment, there is an apparatus configured for detecting acancer in a patient, comprising: a non-transitory memory; and aprocessor coupled to the non-transitory memory, the processor configuredto execute steps of: retrieving an N number of reads of the patient cellfree DNA (cfDNA) methylation profile, N being a positive integer;identifying a J number of CpG clusters in the cfDNA methylation profile,J being a positive integer; retrieving a K number of DNA methylationmarkers of a cancer, K being a positive integer; determining the Knumber of marker regions in the cfDNA methylation profile, wherein themarker regions are the CpG clusters that correspond to the DNAmethylation markers of the cancer; retrieving a T-class methylationpattern of each marker region, expressed as m_(k) ^(T), wherein mdenotes marker region, T denotes T-class, k=1,2, . . . K, the T-classmethylation pattern is a methylation pattern derived from cfDNAs oftumor cells of the cancer; retrieving an N-class methylation pattern ofeach marker region, expressed as m_(k) ^(N), wherein m denotes markerregion, N denotes N-class, k=1,2, . . . K, wherein the N-classmethylation pattern is a methylation pattern derived from cfDNAs ofnormal cells; and calculating a burden θ based on the N number of readsof the cfDNA methylation profile, the K number of m_(k) ^(T), and Knumber of m_(k) ^(N). In some embodiments, the apparatus is portable.

In another embodiment, administering to the patient an effective amountof chemotherapy, radiation, or immunotherapy after the patient has beendetermined to have cancer based on a method comprising retrieving an Nnumber of reads of the patient cell free DNA (cfDNA) methylationprofile, N being a positive integer; identifying a J number of CpGclusters in the cfDNA methylation profile, J being a positive integer;retrieving a K number of DNA methylation markers of a cancer, K being apositive integer; determining the K number of marker regions in thecfDNA methylation profile, wherein the marker regions are the CpGclusters that correspond to the DNA methylation markers of the cancer;retrieving a T-class methylation pattern of each marker region,expressed as m_(k) ^(T), wherein m denotes marker region, T denotesT-class, k=1,2, . . . K, the T-class methylation pattern is amethylation pattern derived from cfDNAs of tumor cells of the cancer;retrieving an N-class methylation pattern of each marker region,expressed as m_(k) ^(N), wherein m denotes marker region, N denotesN-class, k=1,2, . . . K, wherein the N-class methylation pattern is amethylation pattern derived from cfDNAs of normal cells; and calculatinga burden θ based on the N number of reads of the cfDNA methylationprofile, the K number of m_(k) ^(T), and K number of m_(k) ^(N).

In another embodiment, a method for detecting a cancer in a patientbased on a biological sample from the patient. In some embodiments,methods comprise using a computer program product that implements 1, 2,3, 4, 5, 6 or more of the following steps: retrieving an N number ofreads of the patient cell free DNA (cfDNA) methylation profile, N beinga positive integer; identifying a J number of CpG clusters in the cfDNAmethylation profile, J being a positive integer; retrieving a K numberof DNA methylation markers of a cancer, K being a positive integer;determining the K number of marker regions in the cfDNA methylationprofile, wherein the marker regions are the CpG clusters that correspondto the DNA methylation markers of the cancer; retrieving a T-classmethylation pattern of each marker region, expressed as m_(k) ^(T),wherein m denotes marker region, T denotes T-class, k=1,2, . . . K, theT-class methylation pattern is a methylation pattern derived from cfDNAsof tumor cells of the cancer; retrieving an N-class methylation patternof each marker region, expressed as m_(k) ^(N), wherein m denotes markerregion, N denotes N-class, k=1,2, . . . K, wherein the N-classmethylation pattern is a methylation pattern derived from cfDNAs ofnormal cells; and calculating a burden θ based on the N number of readsof the cfDNA methylation profile, the K number of m_(k) ^(T), and Knumber of m_(k) ^(N).

The foregoing has outlined rather broadly the features and technicaladvantages of the present invention in order that the detaileddescription of the invention that follows may be better understood.Additional features and advantages of the invention will be describedhereinafter that form the subject of the claims of the invention. Itshould be appreciated by those skilled in the art that the concepts andspecific embodiments disclosed may be readily utilized as a basis formodifying or designing other structures for carrying out the samepurposes of the present invention. It should also be realized by thoseskilled in the art that such equivalent constructions do not depart fromthe spirit and scope of the invention as set forth in the appendedclaims. The novel features that are believed to be characteristic of theinvention, both as to its organization and method of operation, togetherwith further objects and advantages will be better understood from thefollowing description when considered in connection with theaccompanying figures. It is to be expressly understood, however, thateach of the figures is provided for the purpose of illustration anddescription only and is not intended as a definition of the limits ofthe present invention.

In still further embodiments, the cancer is melanoma, or a cancerouscells that are or are from non-small cell lung, small-cell lung, lung,hepatocarcinoma, retinoblastoma, astrocytoma, glioblastoma, gum, tongue,leukemia, neuroblastoma, head, neck, breast, pancreatic, prostate,renal, bone, testicular, ovarian, mesothelioma, cervical,gastrointestinal, lymphoma, brain, colon, sarcoma or bladder. The cancermay include a tumor comprised of tumor cells.

In some embodiments, there are methods for treating cancer in a cancerpatient comprising administering to the patient an effective amount ofchemotherapy, radiation therapy, or immunotherapy (or a combinationthereof) after the patient has been determined to have cancer based onmethods disclosed herein. The point of origin of the cancer may bedetermined, in which case, the treatment is tailored to cancer of thatorigin. In some embodiments, tumor resection is performed as thetreatment or may be part of the treatment with one of the othertreatments.

In certain embodiments, there are methods of diagnosing a patient basedon determining whether the patient has a methylation profile indicativeof cancer. In some embodiments, methods involve generating a methylationprofile that indicates whether the patient has cancer, and if so, fromwhat organ. In certain embodiments, this is done using a biologicalsample from the patient that comprises cell free DNA.

Methods may further involve performing a biopsy, doing a CAT scan, doinga mammogram, performing ultrasound, or otherwise evaluating tissuesuspected of being cancerous after determining the patient's methylationprofile. In some embodiments, cancer that is found is classified in acancer classification. Cancer classifications may be qualified as any ofStages I, II, III, or IV.

In some embodiments, methods may also involve comparing a measured valueto a control that is indicative of a relevant noncancerous tissue or ofrelevant cancerous tissue. In certain embodiments, a measured value iscompared to predetermined threshold value. In some embodiments, thatcertain level or a predetermined threshold value is at, below, or above1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,93, 94, 95, 96, 97, 98, 99, 100 percentile, or any range derivabletherein. Moreover, the value or control may be based on at least or atmost 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 400, 500 ormore patients (or any range derivable therein).

In one aspect, provided herein is a computer program product comprisinga computer-readable medium having computer program logic recordedthereon arranged to put into effect the method of an embodimentdisclosed herein.

It is contemplated that any embodiment discussed in this specificationcan be implemented with respect to any method, system, kit,computer-readable medium, or apparatus of the invention, and vice versa.Furthermore, apparatuses of the invention can be used to achieve methodsof the invention. Moreover, it is specifically contemplated that anyembodiment discussed herein may be specifically excluded.

It will be understood by one of skill in the art that the embodimentsdisclosed herein can be combined in any combination.

BRIEF DESCRIPTION OF THE DRAWINGS

Those of skill in the art will understand that the drawings, describedbelow, are for illustrative purposes only. The drawings are not intendedto limit the scope of the present teachings in any way.

FIG. 1 depicts an exemplary embodiment, illustrating (A) the rationaleof how “phased” methylation pattern based analysis is more sensitive fordetecting tumor-derived cfDNAs than the analysis based on averagemethylation rates of individual CpG sites; and (B) the rationale of how“phased” methylation analysis can be naturally used for inferring cfDNAscomposition. Filled circles indicate methylated sites and empty circlesindicate un-methylated sites.

FIG. 2 depicts an exemplary embodiment, illustrating the likelihood ofidentifying tissue-of-origin of cfDNA sequencing reads and their use forinferring normal tissue composition of plasma cfDNAs.

FIG. 3 depicts an exemplary embodiment, illustrating integrativeanalysis of “phased” cfDNA methylation. The black boxes with white textare components of the method, and the gray boxes are data.

FIG. 4 depicts an exemplary embodiment, illustrating how methylationsignatures can be established based on averaged value across apopulation of tissue samples. Two models are used for characterizingmethylation signature at different resolutions: bins (model 1) and CpGsites (model 2).

FIG. 5 depicts an exemplary embodiment, illustrating a process forobtaining plasma cfDNA methylation sequencing data.

FIG. 6 depicts an exemplary embodiment, illustrating an EM algorithm'sflowchart. For the new unknown tissue, its estimated methylationsignature m_(j) ^(new)=f(q, R), where f(q, R) is either Eq. (9) or Eq.(10), depending on which model of methylation signature is used

FIG. 7 depicts an exemplary embodiment, illustrating how plasma cfDNAcancer and tissue compositions of a patient are used for cancerdiagnosis. (A) Compute Z-score to evaluate how far the fraction of aclass t is from normal plasma samples; (B) Integrate Z-score of eachclass's faction into a prediction score.

FIG. 8 depicts an exemplary embodiment, illustrating healthy-or-cancerprediction performance (area under the curve: AUC) of simulation datawith different tumor burdens in plasma cfDNAs.

FIG. 9 depicts an exemplary embodiment, illustrating predictionperformance of using only single dimension and both dimensions for twotypes of classification task: (i) binary classification—a new patient ishealthy or gets cancer; and (ii) multi-class classification—a newpatient is classified to normal, liver cancer or lung cancer. We reportthe prediction performance averaged over different random test set(i.e., AUC and confusion matrix). In the confusion matrix, each entryvalue in a true class (t) and predicted class p is calculated as afraction:

$\frac{\begin{matrix}{{NUMBER}{OF}{SAMPLES}{WITH}{TRUE}{CLASS}} \\{{LABEL}t{AND}{PREDICTED}{AS}{CLASS}p}\end{matrix}}{{NUMBER}{OF}{ALL}{SAMPLES}{WITH}{TRUE}{CLASS}{LABEL}t}.$

FIG. 10 depicts an exemplary embodiment, illustrating analysis of twoliver cancer patients before and at multiple time points after surgicalresection. The pie chart at each time point for two cancer patients isthe composition of 14 tissues: A for Patient 1 and B for Patient 2. Clists the Z-score values of Patient 1 and Patient 2.

FIG. 11 depicts an exemplary embodiment, illustrating three models forcharacterizing DNA methylation pattern based on individual value of atissue sample. Three models can be used for characterizing methylationsignature at different resolutions (from high to low): (model 1)epialleles, (model 2) CpG sites, and (model 3) bins.

FIG. 12 is a flow chart of a method for screening cancer and identifyingtissue-of-origin of a tumor, according to one embodiment.

FIG. 13 is a mixture model of methylation level (x) in a patient'splasma cfDNA, for different burdens of ctDNAs from the tumor type t,according to one embodiment.

FIG. 14A-B. A is a histogram of predicted ctDNA burdens for normalsamples, according to one embodiment. B is a comparison of predicted andtrue ctDNA burdens for cancer samples, according to one embodiment.

FIG. 15 is a bar chart showing the prediction performance of anembodiment of the disclosure.

FIG. 16 is a relationship between ctDNA burden and tumor tissueprediction for each plasma of the real data according to one embodimentof the disclosure.

FIG. 17 illustrates the data partition for learning discriminatingfeatures, in both simulation experiments and real data experimentsaccording to one embodiment of the disclosure.

FIG. 18 illustrates a computer system for obtaining access to databasefiles for detecting cancer and identifying tissue-of-origin according toone embodiment of the disclosure.

FIG. 19 illustrates a computer system configured for cancer detectionand tissue-of-origin identification according to one embodiment of thedisclosure.

FIG. 20 Illustration of the rationale why the methylation value averagedacross all CpG sites in a sequencing read (α-value) is more sensitive atdetecting tumor-derived cfDNAs than the traditional methylation level ofa CpG site averaged across all reads (β-value). Each line represents asequencing read and each dot represents a CpG site.

FIG. 21 Overview of the CancerDetector method. The color of cfDNAsequencing reads represents their origin: red (green) reads are fromtumor (normal plasma) cfDNA fragments.

FIG. 22 Illustration of calculating the likelihood of a cfDNA sequencingread in a marker, given the methylation patterns of normal and tumorclasses.

FIG. 23 Predicted blood tumor burdens (averaged over 10 runs) for theliver cancer cfDNA samples, simulated by subsampling and mixingsequencing reads from a real healthy cfDNA sample (N1L or N2L) and asolid liver tumor sample (HCC1 or HCC2) at 8 different tumor burdens: 0,0.1%, 0.3%, 0.5%, 0.8%, 1%, 3%, 5%, and at 3 different sequencingcoverages (2×, 5×, and 10×). In each log-log plot, a blue pointrepresents a simulated sample with error bars (standard deviation ofpredicted tumor burden), the x-axis is its true tumor burden and they-axis is its predicted tumor burden. When the predicted tumor burden isout of range (>5%), we draw the point above the box.

FIG. 24 Predicted blood tumor burdens for the real data in all 10 runs:(A) average ROC curve with standard deviation bars for CancerDetector,(B) average ROC curve with standard deviation bars for our previousmethod CancerLocator, and (C) the relationship between the tumor sizeand average blood tumor burden predicted by CancerDetector.

FIG. 25 Average predicted blood tumor burdens for longitudinal data oftwo liver cancer patients before and after tumor resections in all 10runs. The 2^(nd) patient passed away after surgery.

DETAILED DESCRIPTION OF EMBODIMENTS Definitions

Unless otherwise noted, terms are to be understood according toconventional usage by those of ordinary skill in the relevant art.

In one aspect, disclosed herein are methods for disease diagnosis basedon sequencing data (e.g., nucleic acid) from blood samples. As disclosedherein, sequencing data include information concerning cfDNA, cell-freemicroRNA, and immune repertoire sequencing. cfDNA is used throughout theapplication as the main example but it should not in any way limit thescope of the invention.

In one aspect, disclosed herein are methods and systems for establishinglibraries of sequencing signatures (e.g., methylation signatures) basedon existing sequencing methylation data (e.g., both array-based andconventional non-array based sequencing data). All suitable existingsequencing methylation data can be used here, including, for example,cfDNA, cell-free microRNA, and immune repertoire sequencing data. Insome embodiments, the sequencing data can be disease specific (e.g.,cancer specific) and/or tissue specific.

In one aspect, methods provided can also be used to identify tissue typeand predict cancer prognosis and patient survival. Here, cfDNA isobtained from blood samples via minimally invasive procedures. In someembodiments, sequencing data from the cfDNA samples comprise sequencingreads each including a consecutive sequence of nucleic acids. In someembodiments, sequencing data from the cfDNA samples further comprisemethylation status of selected sequences within the consecutivesequence. As disclosed herein, methylation status includes both thepresence and absence of methylation modification at nucleic acid sites.

In one aspect, methods disclosed herein can also be used to monitorcancer treatment for identifying driver clones, driver genes and driverregulatory pathways.

In one aspect, methods provided can also be used to identify thecomposition of tissue types in plasma cfDNA and predict cancer diagnosisand prognosis, and identify the tumor type that a patient is likely toget.

Among all plasma cfDNAs based cancer diagnosis methods, cfDNAmethylation alterations have attracted a lot of research interests,because (i) methylation alterations are hypothesized to be earlycarcinogenesis mechanism in literatures and thus be potentially earlycancer indicators, and (ii) a large amount of aberrant genome-wide DNAmethylation patterns are observed in plasma cfDNAs in recent studies.Almost all existing cfDNA-based methylation analysis methods are basedon the measurements of average methylation rates over individual CpGsites or genomic regions. The CpG sites or CG sites are regions of DNAwhere a cytosine nucleotide is followed by a guanine nucleotide in thelinear sequence of bases along its 5′ to 3′ direction. CpG is shorthandfor 5′-C-phosphate-G-3′, that is, cytosine and guanine separated by onlyone phosphate; phosphate links any two nucleosides together in DNA. Asshown in the toy example of FIG. 1A (taking liver tumor as example),however, the average methylation rates overlook the mixture nature ofcfDNAs, by which the true cancer signals (tumor-derived cfDNAs orctDNAs) are overwhelmed by normal plasma cfDNAs. Therefore, for earlycancer stage with a small fraction of ctDNAs, the average methylationrates of individual CpG sites in a mixed cfDNAs, e.g., m_(mixed)=0.99 inFIG. 1A, have negligible difference from those of normal plasma cfDNAs,m_(normal)=1. Considering measurement errors and biases, this fact oftenmakes existing methods fail to sensitively detect tumor signals forearly cancer patients. However, when we investigate each individualcfDNA fragment (that can be captured by methylation sequencing reads),the tumor-derived cfDNA fragments can be easily distinguished fromnormal plasma cfDNA fragments, although ctDNA fraction in plasma is only1%. The rationale is depicted in FIGS. 1A and 1B.

The coincidence of methylation status of multiple adjacent CpG sitesover an individual cfDNA fragment actually is the so-called “phased”methylation pattern of individual cfDNA fragments or “epialleles” inliterature. In this study, we hypothesize that the “phased” methylationpattern on individual cfDNA fragments are more sensitive than averagemethylation rates of individual CpG sites or genomic bins. There havebeen several recent studies realizing the sensitivity of “phased”methylation patterns in solid tumor samples for clonal evolution andintra-tumor phylogenies and purification. So far, none studies applied“phased” methylation patterns to cfDNA-based cancer diagnosis. We aretherefore the first to propose the “phased” cfDNA methylation analysisfor cancer diagnosis. The genome-wide methylation sequencing of cfDNAs,e.g., whole-genome bisulfite sequencing (WGBS) or reduced representationbisulfite sequencing (RRBS), provides abundant “phased” methylation datafor fueling our “phased” analysis method. That is, sequencing reads ofWGBS or RRBS covering at least 3 CpG sites are “phased” methylationdata. FIG. 1A actually reveals an intuitive “phased” methylationanalysis method: we can label each cfDNA read which source it is likelyto come from (normal plasma or liver tumor), then infer the fraction ofthose reads most likely from liver tumor among all cfDNA reads. This isindeed a cfDNA reads categorization and two-class composition inferenceprocess, as formally illustrated in FIG. 1B. As a result, the elevatedfraction of liver tumor derived cfDNA reads can imply the risk ofgetting liver cancer for a patient. If we repeat this process for othercancer types, we can not only determine if a patient gets a cancer ornot, but also predict which cancer type the patient may get. The latterprediction can be made by choosing the cancer type which has the mostabnormally elevated fraction in plasma cfDNAs.

More than de-convoluting cfDNAs into two classes (normal plasma and aspecific tumor type), the fact that plasma cfDNAs are mixture of DNAsreleased from normal tissues of various organs, reminds us anotherdifferent process: de-convoluting cfDNAs into a composition of multiplenormal tissue types. This is a process of tissue-of-originidentification and tissue composition inference. Methylation data isideal for this process because previous studies provided rich evidencesthat methylation patterns have abundant tissue-specific biomarkers4. Asshown in FIG. 2 , we follow the same procedure as above: firstidentifying the tissue-of-origin likelihood of each cfDNA read, theninferring the tissue composition of plasma cfDNAs. Sun et al. (2015)have demonstrated the most abnormally elevated tissue fraction of plasmacfDNAs can be used as the evidence for cancer type prediction4, althoughtheir method is based on the average methylation rates of genomicregions, not on “phased” methylation data.

The above two different “phased” methylation pattern based cfDNAanalyses explore the limited tumor signals overwhelmed in massivetumor-irrelevant cfDNAs from different biological dimensions and basedon different genome regions. Our results showed that there is verylittle overlap (8%) between genomic regions of cancer-specific andtissue-specific methylation patterns. Therefore, this observation leadsto the integration of these two non-overlapping analyses for jointlymaking decision of cancer diagnosis. In this work, we propose to performthe integrative analysis of “phased” cfDNA methylation for noninvasivecancer diagnosis.

Overall Process

An exemplary overview of the integrative analysis of “phased” cfDNAmethylation is illustrated in FIG. 3 . There are three components inthis method: (i) establishing methylation signatures from publicmethylation data, (ii) a probabilistic framework of inferring cfDNAscomposition, and (iii) integration method for cancer diagnosis.Processes can be established for determining tissue composition andtumor composition in the cfDNA samples.

In the first component, as many methylation data as possible arecollected from public data, such as the Cancer Genome Atlas (TCGA), GeneExpression Omnibus (GEO) repository, Roadmap epigenomics, and otherarticles in the scientific literature. Cancer-specific or normaltissue-specific methylation signatures are established based on theexisting methylation data to building a library of methylationsignatures where each signature corresponds to a cancer type or tissuetype.

In the second component, a patient's plasma cfDNAs is sequenced andhis/her “phased” methylation data is obtained, for example, using theWhole Genome Bisulfite Sequencing (WGBS) method from Illumina or theReduced Representation Bisulfite Sequencing (RRBS) method. Then, a“phased” methylation based probabilistic method is applied to infer theplasma cfDNA compositions with regard to two classes.

Using sequencing data of cancer patients, cancer-specific methylationsignatures can be established. The cancer-specific methylationsignatures can be used to detect the presence of a specific cancer typeas well as determine the relative composition of normal plasma and aspecific cancer type within a particular cfDNA sample.

Using sequencing data of normal or non-cancer patients, tissue specificmethylation signatures can established. The tissue-specific methylationsignatures can be used to detect the presence of a specific tissue typeas well as determine the relative composition of different tissue typeswithin a particular cfDNA sample.

In the third component, different plasma cfDNA compositions inferred inthe previous step are integrate to answer two questions: (i) is thispatient healthy or cancerous? (ii) if he/she gets cancer, where is thetumor?

Additional description and exemplary embodiments of each component areincluded in the following and the examples.

Methylation Signatures

The massive amount of publicly available methylation data, includingWGBS, RRBS and array-based data, can be utilized to establish thecancer-specific and tissue-specific methylation signatures. Amethylation signature includes at least two types of information: itdenotes a genomic region and it represents the methylation status of thegenomic region. In some embodiments, the methylation signatures alsodescribe the inter-individual variance of methylation levels in apopulation of a tissue (FIG. 2 ) or tumor type (FIG. 1B).

For examples, a two-step procedure is adopted to establish methylationsignatures:

Step 1: Genomic regions that can differentiate cancer types (or normaltissue types), so called “cancer-specific” methylation regions (or“tissue-specific” methylation regions) will be identified. In someembodiments, genomic regions can be represented as a segment on aparticular chromosome; for example, chromosome 11, nt 2,000 to nt 3,000.We can either use existing “differential methylation region” (DMR)detection methods, or design a simple scoring function, that describesthe differential power of each region, such as in the work by Sun et al.(2015).

Step 2: In each region identified in Step 1, methylation signature ischaracterized at different genomic resolutions (CpG sites or genomicbins), for each class (i.e., a cancer type, or a tissue type, or normalplasma) using a population of samples of this class. Choosing a genomicresolution for methylation signature depends on the information publicmethylation data can provide.

As noted above, methylation signatures comprise at least two types ofinformation: genomic regions or locations and methylation status withinthe genomic regions or locations. Methylation signatures can berepresented in multiple ways. Methylation signatures can be establishedat both population and individual levels. For example, at the populationlevel, the Beta distribution modeling methylation rates of a set ofindividuals for a particular genomic region can be determined either atthe bin level (model 1) or at individual CpG site level (model 2); see,FIG. 4 . At the individual level, methylation rate can be determinedfrom raw bisulfite sequencing data (model 1), at individual site level(model 2) and at bin level (model 3); see, FIG. 11 .

It would be understood that a particular disease can correspond tomultiple methylation signatures. In some embodiments, a diseasecorresponds to two methylation signatures; five or fewer methylationsignatures; 10 or fewer methylation signatures; 15 or fewer methylationsignatures; 20 or fewer methylation signatures; 50 or fewer methylationsignatures; 100 or fewer methylation signatures; 150 or fewermethylation signatures; 200 or fewer methylation signatures; 250 orfewer methylation signatures; 500 or fewer methylation signatures; 750or fewer methylation signatures; 1,000 or fewer methylation signatures;1,500 or fewer methylation signatures; 2,000 or fewer methylationsignatures; 3,000 or fewer methylation signatures; 4,000 or fewermethylation signatures; 5,000 or fewer methylation signatures; 7,500 orfewer methylation signatures; 10,000 or fewer methylation signatures;15,000 or fewer methylation signatures; or 20,000 or fewer methylationsignatures. In some embodiments, a disease corresponds to 20,000 or moremethylation signatures.

In some embodiments, the methylation signatures are disease specific. Insome embodiments, the methylation signatures are not disease specific.However, the methylation signatures vary significant between diseasessuch that the variance can be used to detect the presence of aparticular disease type.

In some embodiments, methylation data can be established at the binlevel. See, for example, FIG. 11 model 3. The size of a bin is apre-determined length of nucleic acid sequence. For example, a bin caninclude 10,000 nt or fewer; 5,000 nt or fewer; 2,500 nt or fewer; 1,500nt or fewer; 1,000 nt or fewer; 800 nt or fewer; 600 nt or fewer; 500 ntor fewer; 400 nt or fewer; 300 nt or fewer; 200 nt or fewer; 100 nt orfewer; 50 nt or fewer; 40 nt or fewer; 20 nt or fewer; or 10 nt orfewer. In some embodiments, a bin can include 10,000 nt or more;

For example, for a class t of samples, each sample has a methylationrate m^(t) in the range [0,1] in the same bin. It is usually modeled byBeta distribution m^(t)˜Beta(a^(t), β^(t)).

In some embodiments, methylation data can be established at individualCpG site level. See, for example, FIG. 11 model 2. For CpG site j of theclass t, each sample has a methylation rate m_(j) ^(t), which is usuallymodeled by Beta distribution m_(j) ^(t)˜Beta(a_(j) ^(t), β_(j) ^(t)). Inthese embodiments, methylation status is provided for each CpG site fora given segment of nucleic sequence.

As disclosed herein, the symbol Ω^(t) is used to denote all methylationsignatures for class t. Here, each class can be a disease type, a cancertype, a tissue type and etc.

Alternatively, DNA methylation pattern can be defined based on rawbisulfite sequencing data and frequency histogram. See, for example,FIG. 11 model 1.

Similar methods and systems can be used to calculate methylationpatterns of sample whose tissue type and disease type are unknown. Forexample, blood sample can be taken from a patient and sequencing datacan be obtained for cfDNA samples derived from the blood sample. Here, amethylation pattern obtained from an unknown sample also comprises twotypes of information: nucleic acid sequence and methylation statusassociated therewith.

However, as disclosed herein, sequencing data comprise sequencing reads.Sequencing reads are raw sequences derived from consecutive segments ofnucleic acids. As such, methylation status of any sequencing read is“phased;” i.e., it represents the methylation status of only one of thealleles from the diploid chromosomal DNA. When methylation status isaveraged over multiple reads, it becomes allele non-specific and canalso be called “unphased.”

As disclosed herein, a sequencing read may include 1,000 nt or fewer;800 nt or fewer; 600 nt or fewer; 500 nt or fewer; 400 nt or fewer; 300nt or fewer; 200 nt or fewer; 100 nt or fewer; 50 nt or fewer; 40 nt orfewer; 20 nt or fewer; or 10 nt or fewer. In some embodiments, asequencing read may include 1,000 nt or more (or any range derivabletherein).

Analytical Framework

As described hereinabove, methylation signatures are established basedon existing sequencing data from specific tissue type or disease type(e.g., cancer type), forming libraries of methylation signatures. Eachlibrary is associated with a particular tissue type or disease type(e.g., a particular cancer type). These libraries can be used asstandards for subsequent analysis.

For example, methylation patterns of a sequencing read can be comparedwith methylation signatures in one or more established libraries.Methylation patterns similar to established methylation signaturessuggest that the cfDNA sample includes nucleic acid fragments from theparticular tissue type, thus determining composition of plasma cfDNA.Additionally, methylation patterns similar to established methylationsignatures can also suggest the cfDNA sample includes nucleic acidfragments relating to a particular disease such as liver cancer or lungcancer.

In some embodiments, multiple blood samples can be taken from the samepatient over a period of time. Methylation patterns derived from theseblood samples can be used to monitor disease onset for high riskpopulation. Alternatively, methylation patterns derived from these bloodsamples can be used to monitor disease progression; e.g., cancerprognosis.

In some embodiments, a probabilistic framework can be used to determinethe relation between methylation patterns of unknown samples andestablished methylation signatures. Within the framework, either anexhaustive search of all possible solutions (called grid searchalgorithm) or an expectation-maximization (EM) algorithm applied. The EMalgorithm is an iterative method for finding maximum likelihood ormaximum a posteriori (MAP) estimates of parameters in statisticalmodels, where the model depends on unobserved latent variables.

Exemplary embodiments of a probabilistic analysis are presented inExample 2. This probabilistic framework and EM algorithm is flexible,because we can (i) either add or remove the new unknown tissue type fromthem; and (ii) consider different models of tissue-specific methylationsignature that depend on what resolution methylation data we can collectfrom public database.

In some embodiments, personalized deconvolution of plasma cfDNAs can beachieved using the buffy coat sample of the same patient. We assume apatient's white blood cells of buffy coat sample releases their DNAsinto the plasma. Therefore, in the normal tissue composition inferencestep, we can use methylation signatures of white blood cells of thepatient himself, instead of using methylation signatures of otherpeople's white blood cells. It is expect that this may remove someinter-individual variances or germline variances.

In some embodiments, methods disclosed herein can be applied to analyzethe composition and tissue origin of cfDNA. In some embodiments, changesin such compositions can be used to monitor the health of an individual.For example, detecting of cancerous nucleic acid material is an obviouswarning sign, which warrants further tests and examinations. Forexample, the liver DNA component in the cfDNA of a healthy individualmay be within a certain range. A sudden rise of liver DNA component incfDNA samples may indicate altered health conditions. Similarly, thesudden decrease of a particular DNA component in the cfDNA sample of anindividual may also suggest altered health conditions.

In some embodiments, methods disclosed herein can be applied to analyzethe composition of blood cells such as white blood cells. Recent studiesreported that the abnormal composition of different white blood celltypes in buffy coat sample of a patient is also a disease indicator.Particularly, the peripheral blood immune cell methylation profiles arereported to be associated with non-hematopoietic cancers. Therefore, wecan apply our method to this problem and integrate it as the thirddimension for cancer diagnosis.

cfDNA Composition and Disease Diagnosis and Prognosis

After the composition and tissue origin of cfDNA are determined, theinformation can be integrated for disease diagnosis and prognosis. Forexample, plasma cfDNA cancer compositions and tissue composition can beintegrated for cancer diagnosis. These two kinds of compositions can beused as two features of any patient. Ideally, supposing we have plasmacfDNA samples collected from a large amount of healthy people and thepatients with T cancer types, then the naïve Bayes classifier can beused and trained by these 2-feature data for diagnosing whether or not anew coming patient gets cancer and which cancer type he/she gets.

However, due to the limited number of plasma samples, it is impossibleto train Naïve Bayes method for prediction in real data. Therefore, wedesigned a simple diagnosis method based on Z-scores for small samplesize. The intuition is that The more elevated the fraction of class t ofthe new patient is from those of normal people, the more likely thispatient gets cancer and the cancer type is the class t with the mostelevated fraction. As shown in FIG. 7 , we assume that we have obtainedthe empirical distribution of how each tissue's (or tumor's) fraction ina population of normal plasma samples. Therefore, for a new patient, wecan first perform Step 2 to obtain his/her fraction (denoted as x) oftissue (or tumor) t, then calculate Z-score Z_(t) (or Z_(t)′) toevaluate how far x is from normal plasma population:

$Z_{t} = \frac{x - \overset{\_}{x_{{norm}{al}}}}{\sigma_{normal}}$

where x_(normal) (σ_(normal)) is the average (standard deviation)fraction of tissue (or tumor) t in the normal plasma population. Thenthe prediction score is the maximum of Z_(t)+Z_(t)′ over all tissuest=1, . . . , T. This score is intuitive, because the patient whose twofractions of tissue and tumor t in his/her plasma cfDNAs are foremostfrom normal people population, is most likely to get cancer type t. If

$\max\limits_{t}\left( {Z_{t} + Z_{t}^{\prime}} \right)$

falls into the normal range, then this patient is likely to be healthy.

Computer System and Program Product

The method disclosed herein can be implemented as a computer systemand/or a computer program product that comprises a computer programmechanism embedded in a computer readable storage medium. Further, anyof the methods of the present invention can be implemented in one ormore computers or computer systems. Further still, any of the methods ofthe present invention can be implemented in one or more computer programproducts. Some embodiments of the present invention provide a computersystem or a computer program product that encodes or has instructionsfor performing any or all of the methods disclosed herein. Suchmethods/instructions can be stored on a CD-ROM, DVD, magnetic diskstorage product, or any other computer readable data or program storageproduct. Such methods can also be embedded in permanent storage, such asROM, one or more programmable chips, or one or more application specificintegrated circuits (ASICs). Such permanent storage can be localized ina server, 802.11 access point, 802.11 wireless bridge/station, repeater,router, mobile phone, or other electronic devices. Such methods encodedin the computer program product can also be distributed electronically,via the Internet or otherwise, by transmission of a computer data signal(in which the software modules are embedded) either digitally or on acarrier wave.

Some embodiments of the present invention provide a computer system or acomputer program product that contains any or all of the program modulesas disclosed herein. These program modules can be stored on a CD-ROM,DVD, magnetic disk storage product, or any other computer readable dataor program storage product. The program modules can also be embedded inpermanent storage, such as ROM, one or more programmable chips, or oneor more application specific integrated circuits (ASICs). Such permanentstorage can be localized in a server, 802.11 access point, 802.11wireless bridge/station, repeater, router, mobile phone, or otherelectronic devices. The software modules in the computer program productcan also be distributed electronically, via the Internet or otherwise,by transmission of a computer data signal (in which the software modulesare embedded) either digitally or on a carrier wave.

In some embodiments, the method disclosed herein can be implemented in anetworked device selected from the group consisting of a desktopcomputer, a laptop computer, a cellular phone, a personal digitalassistant (PDA), an iPod, a tablet, a mobile device equipped with anetwork device, a smart phone, a pager, a television, a media player, adigital video recorder (DVR), and any other networked devices.

Further Embodiments Related To CancerLocator

Cancer cells can often display aberrant DNA methylation patterns, suchas hypermethylation of the promoter regions of tumor suppressor genesand pervasive hypomethylation of intergenic regions. Therefore, DNAmethylation can be a target for cancer diagnosis in clinical practice.Hyper/hypo-methylated tumor DNA fragments can be released into thebloodstream via cell apoptosis or necrosis, where they become part ofthe circulating cell-free DNA (cfDNA) in plasma. The non-invasive natureof cfDNA methylation profiling may be an effective strategy for generalcancer screening.

Some embodiments may include plasma methylation biomarkers for variousspecific cancers. The differentially methylated marker genes can beidentified by comparing methylation profile data from patients with acertain cancer type to healthy controls. With a variety of methylationprofiles specific to different cancers being identified, the embodimentsdisclosed herein can detect many types of cancers and provide tumorlocation information for further specific clinical investigation basedon a simple non-invasive liquid biopsy.

FIG. 12 is a flow chart of a method 100 for screening cancer andidentifying tissue-of-origin of a tumor, according to one embodiment.

As shown in FIG. 12 , the method 100 learns the informative features ofdifferent cancer types from the vast amount of the cancer genome (TCGA)DNA methylation data 110. The method then models the plasma cfDNAs 125in cancer patients as a mixture of normal cfDNAs 130 and ctDNAs 120.Finally, given the genome-wide methylation profile derived from thecfDNA sample 130 of an unknown patient, the embodiments disclosed hereinuses the informative features 115 to estimate the fraction of ctDNA inthe plasma 125, and the likelihood that the detected ctDNAs comes fromeach tumor type 135. Based on those likelihoods embodiments disclosedherein make the final decision 135 on whether the patient has tumors,and if yes, the locations of the primary tumor.

The first step of the method 100 is to identify the informative featuresof normal plasma 105 and multiple tumor types from the massive TCGAdatabase 110. The method 100 focuses on seven cancer types from the fiveorgans, e.g., breast, colon, kidney, liver, and lung, that are generallyregarded as having a high level of blood circulation. The second step,given the plasma cfDNA methylation profile of a patient, the method 100uses those informative features to simultaneously detect cancer andlocate its tissue of origin.

In the first step, the method selects CpG clusters as features if theirmethylation range (MR) is sufficiently large 115. The “CpG” sites areregions of DNA where a cytosine nucleotide is followed by a guaninenucleotide in the linear sequence of bases along its 5′→3′ direction.CpG sites are grouped into CpG clusters. MR is defined as the range ofaverage methylation levels observed in healthy plasma and differentsolid tumor tissues.

The embodiments disclosed herein group the individual CpG sites into CpGclusters in order to use more mappable reads. For a CpG site covered bya probe in the sequencing process, the embodiment may define the region100 bp (base pair) upstream and downstream as its flanking region, andassume that all CpG sites located within this region have the sameaverage methylation level as the CpG sites covered by probes. Twoadjacent CpG sites are grouped into one CpG cluster if their flankingregions overlap. Finally, only those CpG clusters containing at least 3CpG sites covered by microarray probes are used in the embodiments.

In other cases, embodiments disclosed herein generally choose the sizeof the flanking region and the number of CpGs in a cluster according tothree criteria: (i) at least three CpG sites (in the microarray data)are included to obtain a robust measurement of methylation values in thesolid tumor samples; (ii) the cluster is reasonably sized, so that thereare sufficient CpG sites to calculate the methylation values, even whenlow coverage sequencing data is used; (iii) keep as many clusters thatspan within a type of genomic regions (either CpG islands or shores) aspossible.

In some embodiments, this procedure yields 42,374 CpG clusters, whichtogether include about one half of all the CpG sites on the InfiniumHumanMethylation450 microarray data. For most of those clusters, eachcluster is associated with only one gene. These CpG clusters are usedfor subsequent feature selection. In other embodiments, the proceduremay yield from at or around 40,000 to at or around 50,000 CpG clusters.In other embodiments, the procedure may yield from at or around 30,000to at or around 60,000 CpG clusters. In other embodiments, the proceduremay yield from at or around 20,000 to at or around 70,000 CpG clusters.In other embodiments, the procedure may yield from at or around 10,000to at or around 90,000 CpG clusters. In other embodiments, the proceduremay yield from at or around 5,000 to at or around 100,000 CpG clusters.

The method 100 selected a total number of K=14429 CpG clusters(features) on average, whose MR are no less than the threshold of 0.25.

In other embodiments, featured CpG clusters has a total number of K from14000 to 15000. In other embodiments, featured CpG clusters has a totalnumber of K from 13000 to 16000. In other embodiments, featured CpGclusters has a total number of K from 12000 to 17000. In otherembodiments, featured CpG clusters has a total number of K from 10000 to18000. In other embodiments, featured CpG clusters has a total number ofK from 8000 to 20000. In other embodiments, featured CpG clusters has atotal number of K from 6000 to 22000. In other embodiments, featured CpGclusters has a total number of K from 5000 to 30000.

In other embodiments, the MR threshold can be equal or around 0.2 toequal or around 0.3. In other embodiments, the MR threshold can be equalor around 0.1 to equal or around 0.5. In other embodiments, the MRthreshold can be equal or around 0.05 to equal or around 0.7. In otherembodiments, the MR threshold can be equal or around 0.01 to equal oraround 0.9. In further embodiments, the MR threshold is, is at least, oris at most 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60,65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140,145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210,215, 220, 225, 230, 235, 240, 245, 250, 300, 400, 500, 600, 700, 800,900, 1000 or any range derivable therein.

For each CpG cluster, the method takes into account its variation acrossindividuals by modeling the distribution of methylation levels for thesame tumor type (or normal plasma) as a beta distribution Beta(αt, βt),as shown in 120. The index t=0 represents normal plasma, while t=1, . .. , T represents a tumor type.

In the second step, the method 100 uses the selected features (theselected CpG clusters) and their beta distributions to deconvolute apatient's plasma cfDNA methylation profile 130 into the normal plasmacfDNA distribution and, if applicable, a solid tumor DNA distribution at135. The method 100 may include a probabilistic method that cansimultaneously infer the burden and the tissue of origin of the ctDNA.At 135, if the likelihood of presence for any tumor type is notsubstantially higher than the likelihood that the observed distributionin a normal background, the patient is predicted to be noncancerous.Otherwise, if the likelihood of presence for any tumor type issubstantially higher than a normal background, the patient is predictedto have the tumor type that is associated with the highest likelihood.

In one embodiment, the issue of determining the ctDNA burden θ and tumortype t given a patient's cfDNA methylation profile X can be formulatedas a maximum-likelihood estimation function L(θ, t|X), wherein thelikelihood function is expressed as the product of the likelihoods ofeach selected CpG cluster, assuming that all of the K selected CpGclusters are independent from each other. This is expressed as L(θ,t|X)=Π_(k=1) ^(K)L(θ, t|x_(k)), where x_(k) denotes the methylationlevel of selected CpG cluster k in a cancer patient's cfDNA methylationprofile X. In principle, x_(k) can be a linear combination of the DNAmethylation level of a normal plasma and a DNA methylation level of asolid tumor type t with fraction θ. The normal and tumor components ofthe methylation are denoted by v_(k) and u_(k), as shown in FIG. 13 .That is, x=(1−θ)v+θu (for simplicity, the subscript k of these notationsare skipped). As mentioned earlier, since v and u follow the Betadistributions Beta(α₀, β₀) and Beta(α_(t), β_(t)), respectively, whereint denotes tumor types and t=0 denotes cancer free profile. Therefore, xfollows the distribution ψ(θ, t), which is calculated as the convolutionof two Beta distributions Beta(α₀, β₀) and Beta(α_(t), β_(t)).

In some embodiments, because a patient's plasma may provide low numberof cfDNA, the methylation profile of the patient's plasma is usuallymeasured by sequencing-based methods. Therefore, the methylation levelx_(k) of CpG cluster k can be derived from two numbers, n_(k) and m_(k),where n_(k) denotes the total number of cytosines available in the CpGcluster k and m_(k) denotes the number of methylated cytosines in theCpG cluster k. The embodiments can model m_(k) and n_(k) together as abinomial distribution m_(k)˜Binomial(n_(k), x_(k)), and the likelihoodfunction can be rewritten as L(θ, t|M, N)=Π_(k=1) ^(K)L(θ, t|m_(k),n_(k)). The detailed description of this method is expanded in thefollowing items.

In some embodiments, a mixture model of methylation levels of plasmacfDNAs can be used. For example, the cfDNAs in the plasma of cancerpatients can be regarded as a mixture of normal background DNAs andtumor-released DNAs. Formally, for each CpG cluster k∈{1,2, . . . , K},the methylation level x_(k) of the plasma cfDNA from a given patient canbe approximated as a mixture of v_(k) and u_(k), where v_(k) denotes themethylation levels of the normal plasma sample and u_(k) denotes themethylation levels of the solid tumor tissue. Let θ∈(0,1), wherein θdenotes the proportion of tumor-derived DNAs in plasma cfDNA. Then x_(k)can be expressed as the weighted sum of v_(k) and u_(k), i.e.x_(k)=(1−θ)v_(k)+θu_(k).

The embodiments may assume that an individual carries at most one typeof tumor among the T possible tumor types. Let t∈{0, 1,2, . . . , T} bethe variable representing either normal plasma (t=0) or a tumor type(1≤t≤T). For each CpG cluster k, the embodiments may model itsmethylation level in a sample of type t as a Beta distribution:v_(k)˜Beta(α_(k0), β_(k0)) for normal plasma samples (t=0) andu_(k)˜Beta(α_(kt), β_(kt)) for solid tumor samples of type t∈{1, . . . ,T}, where α_(k0) and β_(k0) (or α_(kt) and β_(kt)) are the parameters ofthe beta model of methylation levels of CpG cluster k in normal plasma(solid tumor) samples. As illustrated in Step 1 of FIG. 12 , theparameters of these Beta distributions are estimated by the method ofmoments, using the large amount of public tumor data and normal plasmadata.

In some embodiments, by integrating the two Beta distributions (v_(k)and u_(k)), as shown in FIG. 13 , x_(k) can be modeled by a deriveddistribution with the given ctDNA burden θ and source tumor type t. Thismodel is denoted as the probability density function ψ(x_(k)|θ, t),which is calculated by the convolution of Beta(α_(k0), β_(k0)) andBeta(α_(kt), β_(kt)). It is formally expressed below:

$\begin{matrix}{{\psi\left( {\left. x_{k} \middle| \theta \right.,t} \right)} = {\int_{0}^{1}{{f_{Beta}\left( {\left. \frac{x_{k} - {\theta u_{k}}}{1 - \theta} \middle| \alpha_{k0} \right.,\beta_{k0}} \right)}{f_{Beta}\left( {\left. u_{k} \middle| \alpha_{kt} \right.,\beta_{kt}} \right)}{du}_{k}}}} & {{Eq}.(1)}\end{matrix}$

where f_(Beta) is the probability mass function of the Betadistribution.

Some embodiments model the methylated cytosine count of plasma cfDNAsequencing data. In these embodiments, due to its low abundance inplasma, the methylation profile of cfDNA is usually measured bysequencing-based methods, and the methylation levels (x_(k)) of a CpGcluster k can be characterized by the numbers of methylated andunmethylated cytosines on the reads. Let M=(m₁, m₂, . . . , m_(K)) bethe number of methylated cytosines and N=(n₁, n₂, . . . , n_(K)) be thetotal number of cytosines mapped to all CpG sites, where the index runsover all K CpG clusters. For each CpG cluster k, m_(k) can be modeled bya binomial distribution: m_(k)˜Binomial(n_(k), x_(k)). By integratingthe mixture model of x_(k) in Eq. (1), the embodiments have thelikelihood function for each CpG cluster k which has the inputs from themodel parameters (θ, t, α_(k0) and β_(k0), α_(kt) and β_(kt)) and thesequence measurements of plasma samples (m_(k), n_(k)):

f(m _(k) |θ,t,n _(k))=∫₀ ¹ f _(Binomial)(m _(k) |n _(k) ,x _(k))ψ(x _(k)|θ,t)dx _(k)  Eq. (2)

where f_(Binomial) is the probability density function of the binomialdistribution.

Some embodiments utilize maximum likelihood L to estimate blood tumorburden and type, e.g., steps 135 and 140 as shown in FIG. 12 . In theseembodiments, given the methylation sequencing profile of a patient'splasma cfDNA sample, which can be derived the vectors M and N aspreviously disclosed, the embodiments aim to find the maximum likelihoodestimate of two model parameters: (1) this specific sample's cfDNA tumorburden θ and (2) its source tumor type t. For integrating the mixturemodels of multiple markers into the formulation, the embodiments adoptedan assumption: all features or markers are independent from each other.This assumption has been widely used in a number of cell-typedeconvolution studies. Under this assumption, the log-likelihood can bewritten as:

log L(θ,t|M,N)=Σ_(k=1) ^(K) log f(m|θ,t,n _(k))  Eq. (3)

Since the integrals in Eqs. (1)-(2) cannot be easily solvedanalytically, the embodiments use Simpson's rule to calculate thelog-likelihood log L(θ, t|M, N). That is, a set of J predefined θvalues,

${\Theta = \left\{ {0,\frac{1}{J},\frac{2}{J},\ldots,\frac{J - 1}{J}} \right\}},$

is used to conduct a grid search for the best estimation (i.e., a globaloptimization solution). The higher the resolution (J), the more precisethe estimation. After obtaining the solution (i.e., optimized{circumflex over (θ)} and optimized {circumflex over (t)}) thatmaximizes Eq. (3), the embodiments disclosed herein use the estimatedparameters to calculate a simple yet effective prediction score thatanswers two questions “Does the patient have cancer?”, and “If thepatient has cancer, which tumor type is it?” This prediction score isdefined below:

$\begin{matrix}{\lambda = {\frac{1}{K}\left\lbrack {{\log{L\left( {\hat{\theta},\left. \hat{t} \middle| M \right.,N} \right)}} - {L\left( {{\theta = \left. 0 \middle| M \right.},N} \right)}} \right\rbrack}} & {{Eq}.(4)}\end{matrix}$

where the denominator K is used to normalize the log-likelihood, so thatλ is comparable when using a different number of features. The variablet is not included in L(θ=0|M, N) because θ=0 indicates a normal plasmasample. The larger the prediction score λ, the higher the chance thatthe patient has a cancer tumor of type {circumflex over (t)}.Specifically, if λ>a threshold, the patient is predicted as gettingcancer with the ctDNA burden {circumflex over (θ)} and the tumor type{circumflex over (t)}; otherwise, he/she is classified as noncancerous.

In some embodiments, in establishing the prediction model, themethylation data of a simulated plasma cfDNA sample is generated bycomputationally mixing the entire methylation profiles of a normalplasma cfDNA sample and a solid tumor sample (e.g., breast, colon,kidney, liver, or lung tumors), at a variety of ctDNA burdens (θvalues). This strategy can make the simulated methylation datareflecting the potential correlations of methylation values between CpGclusters in real data. In some embodiments, when establishing theprediction model, tumor copy number aberration (CNA) events are added atpre-defined probabilities (10%, 30% and 50% across all CpG clusters).

In some embodiments, the prediction model is evaluated on simulationdata with known ctDNA fractions. The results show that prediction modelcan achieve a Pearson's correlation coefficient (PCC) of 0.975 betweenthe predicted and true proportions of ctDNA, and an error rate of 0.074for the classification of non-cancer and tumor types. Moreover, theprediction model well performed when the proportion of tumor-derivedDNAs in the case that cfDNAs is lower than 50%, which typically inreality represents low CNA. The embodiments of this disclosure achievedpromising prediction results on patient plasma samples, including cancersamples collected from early-stage cancer patients.

As shown in FIG. 14A, the majority (87.9%) of the estimated ctDNAburdens for the normal samples are not more than 0.02, and none of themis greater than 0.05. Please note that, whether a sample is from cancerpatient or not is determined by the optimal likelihood calculated in theprediction model, not the predicted ctDNA burden. The prediction resultsfor the simulated cancer-patient plasma samples are shown in FIG. 14B.

As shown in FIGS. 14A and 3B, the results show that the variance of thepredicted ctDNA burdens (θ) increases with the true θ, implying that theburden estimation becomes less precise when patients are in mid or latecancer stages. This result could be partially explained by the fact thatthere may exist higher tumor heterogeneity in tumor samples of latestage which introduces the complexity of ctDNA burden prediction.However, this increased variance does not reduce the performance of thecancer detection, because the predicted θ is still much higher than thenormal background. Indeed, as demonstrated in FIG. 14B, the predictionof cancer tissue-of-origin of ctDNA becomes more distinguishable withhigh ctDNA burden, despite the increased variance in ctDNA prediction.Note that in FIG. 14B, cyan and red circles represent correct andincorrect predictions.

In FIG. 15 , the embodiments of the prediction model are furtherevaluated. For a systematic comparison, the embodiments divide thesimulation data into 10 subsets for different cancer stages, each ofwhich includes 200 normal plasma samples and 200 cancer plasma samplesof each tumor type. The different cancer stages (from early, mid to latestages) are represented by a set of ctDNA burden ranges (θ, θ+10%] asthe x axis, where θ=0, 10%, 20%, 15 30%, 40%, 50%, 60%, 70%, 80%, 90%.For a 6-class (t=0, 2, . . . 5) cancer classification problem (normal,breast, colon, kidney, liver, and lung), the embodiments in FIG. 15adopt the error rate measure for assessing the classificationperformance. As shown in FIG. 15 , for early-stage cancer patients withctDNA burdens in the range θ∈(0, 10%], the prediction model has an errorrate of 0.240. For the second lowest ctDNA burdens θ∈(10%, 20%], theprediction model reaches a very high prediction performance of errorrate 0.067. The results are notable, because the embodiments of theprediction model perform well with low number of ctDNA fractions,highlighting the usefulness of the embodiments in screening early stagecancers.

As shown in FIG. 17 , the embodiments randomly choose 75% of solid tumorsamples 605 and 75% of healthy plasma cfDNA samples 615 into thetraining set to establish the model parameters. The remaining 25% of thehealthy plasma samples 620 and the 25% of the tumor ctDNA samples 610form the simulation data set. The remaining 25% of the healthy plasmasamples 620 and all the tumor ctDNA samples 625 collected from cancerpatients form the testing set.

Table 1 shows the model prediction results described in FIG. 17 . Afterperforming this procedure (including random data partition andpredictions) ten times, the predictions of the embodiments aresummarized into a confusion matrix, as shown in Table 1. For a newpatient's plasma sample 625, the prediction model assumes no priorinformation about the cancer type. Therefore, the models consider colonand kidney tumors as possible results, even though our real plasma datain Table 1 does not include colon or kidney tumors. The results in Table1 show that the model performed well. Specifically, the majority of thebreast, liver, and lung cancer samples are accurately predicted by thecancer detection model. The cancer detection model obtains a low errorrate of 0.265 for the 6-class prediction problem. The results in Table 1are consistent with the simulation experiments for ctDNA burdens lowerthan 50% shown in FIGS. 14A and 14B.

To further explore the relationship between estimated ctDNA burdens andtumor types in real data, the inventors plot their relationships in FIG.16 by summarizing predictions for each plasma sample in all ten runs:the average predicted ctDNA burden (y-axis value) and the mostfrequently predicted tumor type among ten runs for each sample. It canbe observed that the higher the estimated ctDNA burden, the moreaccurate the prediction of tumor type. This is highly consistent withthe results of simulation data shown in FIGS. 14A and 14B. As shown inFIG. 16 , for the breast cancer samples, three out of five samples havectDNA burdens ≤2.2%, and they are predicted as non-cancer. The predictedctDNA burden of the two correctly predicted samples are 5.0% and 18.0%,respectively, and the latter one is a metastatic sample. For the livercancer samples, at least 25 of all samples are from early-stage(Barcelona Clinic Liver Cancer stage A) patients. A majority of them(80%) were correctly classified by the prediction model as liver cancerand all of them were detected as cancer samples. Compared to the breastcancer samples, most of the liver samples, even at early stage, can havemoderate to high tumor burden (average predicted tumor burden of 14.9%and the highest reaching 59.0%), given that liver has generally strongblood circulation, but we also correctly classified the one with only2.0% predicted ctDNA burden as liver cancer. Among the twelve lungcancer samples (two samples do not have cancer stage information), atleast five of them are collected from early-stage patients. Theseearly-stage samples have predicted ctDNA burdens ranging from 2.0% to4.0%. Among these five early-stage lung cancer samples, four have beencorrectly predicted as lung cancer, whereas the remaining one ispredicted as non-cancer.

The model correctly predicted 7 out of 8 chronic hepatitis B virus (HBV)samples to be non-cancer samples. In addition, our method successfullypredicted the single one sample with benign lung tumor as non-cancer inall ten runs, with the predicted ctDNA burden being 0.0%. These resultsdemonstrate that the cancer detection model disclosed herein can gobeyond distinguishing healthy samples from cancer samples and handlemore sophisticated scenarios, such as differentiating hepatitis B viruscarriers or benign tumor patient from cancer patients.

In addition to being non-invasive, blood-based cancer diagnosis, unliketraditional diagnosis based on tissue biopsy, has the potential todiagnose tumors from many organs. The prediction models disclosed hereinaim to exploit this potential of cfDNA by not only diagnosing thepresence of tumors, but also detecting the tissue of origin. Theembodiments disclosed herein lay out a systematic prediction method forcfDNA-based cancer type inference, comprehensively evaluate itsperformance on both simulated data and real data. The embodimentsdisclosed show accurate and useful predictions especially in early stagecancer when the proportion of tumor-derived DNAs is lower than 50%. Inaddition, the embodiments show the predictions are robust to CNA events,because the genome-wide features may outweigh the local aberrations.

In some embodiments, DNA methylation microarrays of solid tumor tissueswere used to obtain the data to train the model, due to the scarcity ofwhole-genome bisulfite sequencing data in the public domain. Using DNAmethylation microarray data is reasonable because it focuses methylationon promoter regions. Therefore, it is expected that the growing amountof whole-genome bisulfite sequencing data will significantly empower theembodiments disclosed herein because it potentially provides higherresolution data.

In some embodiments, the methylation data are collected and processed inthe following manner. The embodiments collect a large set of publicmethylation data of solid tumors and plasma cfDNA samples taken fromboth healthy people and cancer patients. The majority of tumormethylation profiles in the TCGA (The Cancer Genome Atlas) were assayedusing DNA microarrays, e.g., the Infinium Human Methylation450microarray. The embodiments collect those data for solid tumorswith >100 samples from five different organs: 681 samples of breast(BRCA), 290 samples of colon (COAD), 522 samples of kidney (including300/156 samples of KIRC/KIRP), 169 samples of liver (LIHC), and 809samples of lung (including 450/359 samples of LUAD/LUSC). In thisparagraph, the abbreviation has the following meanings, BRCA: BreastInvasive Carcinoma; COAD: Colon Adenocarcinoma; KIRC: Kidney Renal ClearCell Carcinoma; KIRP: Kidney Renal Papillary Cell Carcinoma; LIHC: LiverHepatocellular Carcinoma; LUAD: Lung Adenocarcinoma; LUSC: Lung SquamousCell Carcinoma.

In some embodiments, two datasets of whole-genome bisulfite sequencing(WGBS) data of plasma samples are taken from 32 normal people, 8patients infected with HBV, 29 liver cancer patients, 4 lung cancerpatients, 5 breast cancer patients, and a number of patients with tumorsin organs without a large blood flow. The embodiments also generatedWGBS data of plasma samples collected from 8 cancer patients (5early-stage lung cancer patients, 1 late-stage lung cancer patient, 2lung cancer patients with unknown stage information) and 1 patient withbenign lung tumor. The embodiments used only the normal, HBV, andbreast/liver/lung cancer patients in our study, for the total of 87plasma samples. Note that these public WGBS data have very lowsequencing coverage (˜4× on average), while the coverage of our newlygenerated data of all the 9 samples is around 10×.

In some embodiments, blood samples of human subjects are used. Theembodiments include blood samples of eight lung cancer patients and onebenign lung tumor patient.

Cell-free DNA (cfDNA) isolation and Whole Genome Bisulfite Sequencing(WGBS) are included in the embodiments disclosed herein. In someembodiments, cfDNA isolation and WGBS are processed in the followingmanners: blood samples were centrifuged first at 1,600×g for 10 minutes,and then the plasma was transferred into new micro tubes and centrifugedat 16,000×g for another 10 minutes. The plasma was collected and storedat −80° C. CfDNA was extracted from 5 ml plasma using a reagent kit,e.g., the Qiagen QIAamp Circulating Nucleic Acids Kit and quantified bya fluorometer, e.g., Qubit 3.0 Fluorometer (Thermo Fisher Scientific).Bisulfite conversion of cfDNA was performed by using a reagent kit,e.g., EZ-DNA-Methylation-GOLD kit (Zymo Research). After that, anotherreagent kit, e.g., Accel-NGS Methy-Seq DNA library kit (SwiftBioscience) was used to prepare the sequencing libraries. The DNAlibraries were then sequenced with 150 bp paired-end reads.

The embodiments build an recognize selected CpG features, i.e., CpGclusters. In some embodiments, the CpG features are built in thefollowing manners. Appropriate microarray data was collected, e.g.,Infinium HumanMethylation450 microarray data from TCGA measuring solidtumor samples for about 450,000 CpGs. Since the testing sample of theembodiments can be WGBS data with very low sequencing coverage, theembodiments group the CpG sites into CpG clusters in order to use moremappable reads. For a CpG site covered by a probe on the microarray, theembodiment may define the region 100 bp (base pair) upstream anddownstream as its flanking region, and assume that all CpG sites locatedwithin this region have the same average methylation level as the CpGsites covered by probes. Two adjacent CpG sites are grouped into one CpGcluster if their flanking regions overlap. Finally, only those CpGclusters containing at least 3 CpG sites covered by microarray probesare used in the embodiments. The embodiments choose the size of theflanking region and the number of CpGs in a cluster according to threecriteria: (i) at least three CpG sites (in the microarray data) areincluded to obtain a robust measurement of methylation values in thesolid tumor samples; (ii) the cluster is reasonably sized, so that thereare sufficient CpG sites to calculate the methylation values, even whenlow coverage sequencing data is used; (iii) keep as many clusters thatspan within a type of genomic regions (either CpG islands or shores) aspossible. In some embodiments, this procedure yields 42,374 CpGclusters, which together include about one half of all the CpG sites onthe Infinium HumanMethylation450 microarray data. For most of thoseclusters, each is associated with only one gene. These CpG clusters areused for subsequent feature selection.

In some embodiments, methylation microarray data are processed.Methylation microarray data may be processed in the following manner.The microarray data (level 3 in the TCGA database) provide themethylation levels of individual CpG sites. The embodiments define themethylation level of a CpG cluster as the average methylation level ofall CpG sites in the cluster. A cluster's methylation level is marked asNot Available (NA) if more than half of its CpG sites do not havemethylation measurements.

In some embodiments, WGBS data are processed. WGBS data may be processedin the following manner. A DNA sequence alignment tool, e.g., Bismark,can be employed to align the reads to the reference genome HG19 and callthe methylated cytosines. After the removal of PCR duplications, thenumbers of methylated and unmethylated cytosines are counted for eachCpG site. The methylation level (x_(k)) of a specific CpG cluster (k) iscalculated as the ratio between the number of methylated cytosines(m_(k)) and the total number of cytosines (n_(k)) within the cluster.However, if the total number of cytosines (n_(k)) in the reads alignedto the CpG cluster is less than 30, the methylation level of thiscluster is treated as “NA”.

In some embodiments, the CpG clusters are further filtered. Featurefiltering can be done in the following manner. For each CpG cluster, theembodiments may use the methylation range (MR) to indicate a feature'sdifferential power between classes. The embodiments first obtained theaverage methylation level of all samples from each class (i.e., healthyplasma, or each tumor type), then defined MR as the range of this set ofmean values (i.e., the difference between the largest and smallest meanvalues). The higher MR of a cluster is, the more differential power ithas. Finally, those CpG clusters whose MRs are no less than a thresholdwere selected.

Some embodiments generate simulation data, e.g., the combination of 610and 620 in shown in FIG. 17 . The simulation data are computationallygenerated from normal plasma and tumor plasma for verifying the accuracyof the prediction model L. The embodiments simulate the methylationsequencing data of a patient's plasma cfDNAs using the previouslydescribed probabilistic models: (i) a mixture model that treats thecfDNA as a mixture of normal plasma cfDNA and DNAs released from primarytumor sites and (ii) a binomial model for the methylated cytosine countof plasma cfDNA sequencing data. In addition, to make the simulationdata more realistic, the embodiments incorporate copy number aberrationsand read depth bias. The procedure for generating simulation plasmacfDNA methylation sequencing data is detailed below.

Generating simulation data may need the following inputs: (i) thegenomic regions of all K CpG clusters, (ii) the total number ofcytosines (Z) on the sequencing reads that are aligned to any CpGcluster, (iii) the range of θ:(θ_(L), θ_(U)), (iv) the collections ofnormal plasma samples (denoted as POOL_(normal)) and solid tumor samples(denoted as POOL_(tumor)), and (v) b_(k), the background probability fora CpG dinucleotide to be aligned to CpG cluster k, satisfying Σ_(k=1)^(K)b_(k)=1. The last input reflects the read depth bias introducedduring sequencing process and reads alignment, and the density of CpGsites in the clusters.

A simulated methylation sequencing profile of a plasma sample isgenerated, represented by the integer vectors M=(m₁, m₂, . . . , m_(K))and N=(n₁, n₂, . . . , n_(K)). The elements m_(k) and n_(k) are thenumber of methylated cytosines and the total number of cytosines in thereads mapped to CpG cluster k, respectively.

In one embodiment, the simulated methylation sequencing profile of aplasma sample is generated with the following procedures. The procedureincludes six steps. Step 1: Generate a random ctDNA fraction θ from thedistribution θ˜Uniform(θ_(L), θ_(U)).

Step 2: Generate a random integer copy number c_(k), for each CpGcluster k, from the categorical distribution c_(k)˜Cat(6, p₀, p₁, p₂,p₃, p₄, p₅), where Cat refers to a categorical distribution. Here p_(c)denotes the probability of observing copy number c∈{0, 1,2, 3, 4, 5} inthe sequencing data. The probabilities p_(c) satisfy three criteria: (i)their sum is equal to one, Σ_(c=0) ⁵p_(c)=1; (ii) the average copynumber is equal to two, Σ_(c=0) ⁵c*p_(c)=2; and (iii) extreme copynumber alterations are less likely to occur. In some cases, theembodiments may predefine p₀=0.005, p₁=0.16, p₂=0.7, p₃=0.105, p₄=0.025,p₅=0.005. Note that the sum of all these probabilities except p₂ (30% inthis case) is the probability of any given CpG cluster having a CNAevent. The embodiments may have other probability configurations for thesimulation with more (50%) or fewer (10%) CNA events, and obtainedsimilar results. No CNA event is considered (i.e. c_(k) is fixed to two)when simulating a normal plasma sample.

Step 3: Randomly select a normal plasma sample from POOL_(normal) whosemethylation profile is denoted by (v₁, v₂, . . . , v_(K)), and randomlyselect a solid tumor from POOL_(tumor) whose methylation level profileis denoted by (u₁, u₂, . . . , u_(K)). Note that some embodiments mayalso randomly select two normal plasma samples from POOL_(normal), inorder to simulate a new normal plasma sample.

Step 4: Calculate the methylation level x_(k) of plasma cfDNA at CpGcluster k. This is the adjusted linear combination of v_(k) and u_(k)after incorporating the copy number c_(k) generated in Step 2. That is,x_(k)=(1−θ_(k)′)v_(k)+θ_(k)θu_(k), where θ_(k)′ is the adjusted value ofθ given by

${\theta_{k}^{\prime} = \frac{\theta c_{k}}{{\theta c_{k}} + {2\left( {1 - \theta} \right)}}},\theta_{k}^{\prime}$

describes the actual ctDNA fraction after considering the copy numberc_(k) of the ctDNA.

Step 5: Generate a random number n_(k), representing the total number ofcytosines in CpG cluster k, from the Poisson distributionn_(k)˜Poisson(ZB_(k)). B_(k) is the adjusted CpG dinucleotide biasb_(k), given by

${B_{k} = \frac{b_{k}\left( {1 - \theta + {\theta{c_{k}/2}}} \right)}{\sum_{k = 1}^{K}{b_{k}\left( {1 - \theta + {\theta{c_{k}/2}}} \right)}}},$

after scaling with the copy number c_(k) generated in Step 2.

Step 6: Generate a random number m_(k) from the binomial distributionm_(k)˜Binomial(n_(k), x_(k)).

Some embodiments also simulated new normal plasma sample by mixing twonormal plasma samples at different mixture ratios. The procedure is thesame as above except that Step 2 is ignored by fixing all copy numbersas two, because there are no CNA events in the normal plasma samples.

Some embodiments have the following method for data partitions forlearning signature features, simulation and real data experiments. Allthe TCGA solid tumor tissues and plasma samples are divided intonon-overlapping sets for three tasks: (i) learning discriminatingfeatures, (ii) simulation experiments, and (iii) testing on the realdata. Specifically, as shown in FIG. 17 , the embodiments split the TCGAsolid tumors of each tissue type into two partitions: 75% 605 forlearning signature features, and 25% 610 for generating simulation data.The embodiments also split all normal plasma samples into twopartitions: 75% 615 for learning signature features, and 25% 620 forgenerating simulation data or for real data experiments. All the plasmasamples of the cancer patients 625 are used to form the testing set inthe real data experiments. Note that not these plasma samples, but onlysolid tumor samples collected from public methylation databases, and asubset of normal plasma samples that were not used for testing, wereused for learning features. All data are randomly partitioned followingthe above proportions, and applying a method on one such partition isregarded as “one run”. For making the robust results, the embodimentsrepeat the experiments for ten runs, and aggregate all predictionsobtained in the ten runs into a single confusion matrix as the finalresult. Because the embodiments have limited number of real cancerplasma samples (only 5, 12, and 29 cfDNA samples from breast, lung, andliver cancer patients respectively) for testing, it would not allow thetypical cross-validation for the method's hyperparameter estimation.

Various methods, steps, calculations of parameters for cancer detectionsand tissue-of-origin identification disclosed herein can be implementedin a computer system 700 as shown in FIG. 18 and/or the computer system800 shown in FIG. 19 . For example, the flow chart of the method 100shown in FIG. 12 can be implemented in the computer system 700 and/orthe computer system 800. In another example, as shown in FIG. 13 , theequation ψ(x_(k)|θ, t) and the parameters involved x, θ, u, v, k, and tcan be implemented as computer readable instructions on computer system700 and/or computer system 800. In another example, the cancer detectionresults shown in FIGS. 14A-16 can be executed by computer system 700and/or computer system 800. In yet another example, the data partitionsfor learning signature features, simulation experiments, and real dataexperiments can be executed by computer system 700 and/or computersystem 800. In yet another example, the confusion matrix shown in Table1 can be executed by computer system 700 and/or computer system 800.

FIG. 18 illustrates a computer system 700 for obtaining access todatabase files for detecting cancer and identifying tissue-of-originaccording to one embodiment of the disclosure. The computer system 700may include a server 702, a data storage device 706, a network 708, anda user interface device 710. The server 702 may also be ahypervisor-based system executing one or more guest partitions hostingoperating systems with modules having server configuration information.In a further embodiment, the system 700 may include a storage controller704, or a storage server configured to manage data communicationsbetween the data storage device 706 and the server 602 or othercomponents in communication with the network 708. In an alternativeembodiment, the storage controller 604 may be coupled to the network708.

In one embodiment, the user interface device 710 is referred to broadlyand is intended to encompass a suitable processor-based device such as adesktop computer, a laptop computer, a personal digital assistant (PDA)or tablet computer, a smartphone or other mobile communication devicehaving access to the network 708. In a further embodiment, the userinterface device 710 may access the Internet or other wide area or localarea network to access a web application or web service hosted by theserver 702 and may provide a user interface for enabling a user to enteror receive information.

The network 708 may facilitate communications of data between the server702 and the user interface device 710. The network 708 may include anytype of communications network including, but not limited to, a directPC-to-PC connection, a local area network (LAN), a wide area network(WAN), a modem-to-modem connection, the Internet, a combination of theabove, or any other communications network now known or later developedwithin the networking arts which permits two or more computers tocommunicate.

FIG. 19 illustrates a computer system 800 configured for cancerdetection and tissue-of-origin identification according to oneembodiment of the disclosure. FIG. 19 also illustrates a computer system800 adapted according to certain embodiments of the server 702 and/orthe user interface device 710. The central processing unit (“CPU”) 802is coupled to the system bus 804. The CPU 802 may be a general purposeCPU or microprocessor, graphics processing unit (“GPU”), and/ormicrocontroller. The present embodiments are not restricted by thearchitecture of the CPU 802 so long as the CPU 802, whether directly orindirectly, supports the operations as described herein. The CPU 802 mayexecute the various logical instructions according to the presentembodiments.

The computer system 800 may also include random access memory (RAM) 808,which may be synchronous RAM (SRAM), dynamic RAM (DRAM), synchronousdynamic RAM (SDRAM), or the like. The computer system 800 may utilizeRAM 808 to store the various data structures used by a softwareapplication. The computer system 800 may also include read only memory(ROM) 806 which may be PROM, EPROM, EEPROM, optical storage, or thelike. The ROM may store configuration information for booting thecomputer system 800. The RAM 808 and the ROM 806 hold user and systemdata, and both the RAM 808 and the ROM 806 may be randomly accessed.

The computer system 800 may also include an I/O adapter 810, acommunications adapter 814, a user interface adapter 816, and a displayadapter 822. The I/O adapter 810 and/or the user interface adapter 816may, in certain embodiments, enable a user to interact with the computersystem 800. In a further embodiment, the display adapter 822 may displaya graphical user interface (GUI) associated with a software or web-basedapplication on a display device 824, such as a monitor or touch screen.

The I/O adapter 810 may couple one or more storage devices 812, such asone or more of a hard drive, a solid state storage device, a flashdrive, a compact disc (CD) drive, a floppy disk drive, and a tape drive,to the computer system 800. According to one embodiment, the datastorage 812 may be a separate server coupled to the computer system 800through a network connection to the I/O adapter 810. The communicationsadapter 814 may be adapted to couple the computer system 800 to thenetwork 708, which may be one or more of a LAN, WAN, and/or theInternet. The user interface adapter 816 couples user input devices,such as a keyboard 820, a pointing device 818, and/or a touch screen(not shown) to the computer system 800. The display adapter 822 may bedriven by the CPU 802 to control the display on the display device 824.Any of the devices 802-822 may be physical and/or logical.

The cancer detection models of the present disclosure are not limited tothe architecture of computer system 800. Rather the computer system 800is provided as an example of one type of computing device that may beadapted to perform the functions of the server 702 and/or the userinterface device 710. For example, any suitable processor-based devicemay be utilized including, without limitation, personal data assistants(PDAs), tablet computers, smartphones, computer game consoles, andmulti-processor servers to implement various embodiments and/or stepsthe cancer detection models disclosed herein. Moreover, variousembodiments of the cancer detection methods of the present disclosuremay be implemented on application specific integrated circuits (ASIC),very large scale integrated (VLSI) circuits, or other circuitry. Infact, persons of ordinary skill in the art may utilize any number ofsuitable structures capable of executing logical operations according tothe described embodiments. For example, the computer system 700 and 800may be virtualized for access by multiple users and/or applications.

Various methods, steps, calculations of parameters disclosed herein ifimplemented in firmware and/or software, the various functions describedabove may be stored as one or more instructions or code on acomputer-readable medium. Examples include non-transitorycomputer-readable media encoded with a data structure andcomputer-readable media encoded with a computer program.Computer-readable media includes physical computer storage media. Astorage medium may be any available medium that can be accessed by acomputer. By way of example, and not limitation, such computer-readablemedia can comprise RAM, ROM, EEPROM, CD-ROM or other optical diskstorage, magnetic disk storage or other magnetic storage devices, or anyother medium that can be used to store desired program code in the formof instructions or data structures and that can be accessed by acomputer. Disk and disc includes compact discs (CD), laser discs,optical discs, digital versatile discs (DVD), floppy disks and Blu-raydiscs. Generally, disks reproduce data magnetically, and discs reproducedata optically. Combinations of the above should also be included withinthe scope of computer-readable media.

In addition to storage on computer-readable medium, instructions and/ordata may be provided as signals on transmission media included in acommunication apparatus. For example, a communication apparatus mayinclude a transceiver having signals indicative of instructions anddata. The instructions and data are configured to cause one or moreprocessors to implement the functions outlined in the claims.

Further Embodiments Related to CancerDetector

Several reasons motivate methylation-based tumor cfDNA detection: (i)DNA methylation patterns are pervasive, meaning that the samemethylation patterns (methylated or unmethylated) tend to spreadthroughout a genome region. This feature has been employed by one groupto evaluate DNA hypomethylation across large genome regions for cancerdiagnosis (Chan et al. (2013)). As disclosed herein this feature is usedto amplify aberrant cfDNA signals but at the resolution of singlesequencing reads, therefore providing an ultra-sensitive detection of atiny amount of tumor cfDNA even at a low sequencing coverage. (ii)Aberrant DNA methylation patterns occur early in the pathogenesis ofcancer (Baylin et al. (2001)), therefore facilitating early cancerdetection. In fact, DNA methylation abnormalities are one of thehallmarks of cancer and are associated with all aspects of cancer, fromtumor initiation to cancer progression and metastasis (Cheishvili et al.(2015), Roy et al. (2014), Plass et al. (2013)), and have becomeattractive targets for cancer epigenetic therapy (Smith et al. (2007),Sigalotti et al. (2007)).

A key aspect involves focusing on the joint methylation patterns ofmultiple adjacent CpG sites on an individual cfDNA sequencing read, inorder to exploit the pervasive nature of DNA methylation for signalamplification. Traditional DNA methylation analysis focuses on themethylation rate of an individual CpG site in a cell population. Thisrate, often called the β-value, is the proportion of cells in which theCpG site is methylated (see an example in FIG. 1 ). However, suchpopulation-average measures are not sensitive enough to capture anabnormal methylation signal affecting only a small proportion of thecfDNAs. FIG. 20 illustrates this point: the average methylation rates ofthe individual CpG sites are β_(normal)=1 for normal plasma cfDNAs, andβ_(tumor)=0 for tumor cfDNAs; assuming the presence of 1% tumor cfDNAs,the traditional measure yields β_(mixed)=0.99, which is hard todifferentiate from β_(normal)=1. However, based on the pervasive natureof DNA methylation, a new way to differentiate disease-specific cfDNAreads from normal cfDNA reads was investigated. When the methylationvalues of all CpG sites in a given read (denoted α-value) are averaged,there is a striking difference (0 and 1) between the abnormallymethylated cfDNAs and the normal cfDNAs (α_(tumor)=0% andα_(normal)=100%). In other words, given the pervasive nature of DNAmethylation, the joint methylation patterns of multiple adjacent CpGsites can easily distinguish cancer-specific cfDNA reads from normalcfDNA reads. Inspired by the α-value, it was realized that the key toexploiting pervasive methylation is to estimate whether the jointprobability of all CpG sites in a read follows the DNA methylationsignature of a disease. This read-based probabilistic approach, istermed “CancerDetector”; it can sensitively identify a trace amount oftumor cfDNAs out of all cfDNAs in plasma.

Some CancerDetector embodiments were evaluated on simulated plasmasamples that subsample and combine sequencing reads of a normal plasmacfDNA sample and a solid tumor sample at known mixing rates (or tumorburdens). The results showed that CancerDetector can achieve a Pearson'scorrelation coefficient (PCC) of 0.9974 (P-value 9.8E-8) between thepredicted and true proportions of tumor cfDNAs at medium sequencingcoverage (10×). And the prediction performance increases with thesequencing coverage—the higher the sequencing coverage, the closer thepredicted tumor burden is to the true value. Moreover, CancerDetectoroutperformed our previous method of cfDNA tumor burden prediction, i.e.,“CancerLocator” (Kang et al. (2017)), in terms of both predictionperformance and robustness. We then tested CancerDetector on real plasmacfDNA samples and demonstrated its high performance across 10experimental runs, i.e., sensitivity of 94.4%±3.7% (when specificity is100%) for early-stage cancer patients; while CancerLocator has asensitivity of 74.4%±10.0% (when specificity is 100%). In addition, thetumor burden predicted by CancerDetector showed great consistency withclinical information, such as tumor size and survival outcome, inlongitudinal samples. Note that we achieved these results based on realsamples that have low sequencing coverage (1×˜3×, averaged across allgenome positions).

FIG. 21 is a method 200 for detecting a cancer, according to oneembodiment. The method 200 includes a step of identifying DNAmethylation markers 110. The method 200 includes a step of sequencing acfDNA methylation profile of a patient 120. The method includes a stepof inferring cfDNA composition using read-based probabilistic model 130.

At the step 110, DNA methylation markers are identified, for example,marker-1 112, marker-2 114, marker-3 116, . . . and marker-K 118.Methylation markers 112, 114, 116, 118 are methylation patterns specificto a certain type a cancer, e.g., liver cancer.

To identify DNA methylation markers, a J number of CpG clusters areidentified in a cfDNA methylation profile. Using liver cancer as anexample, among all J CpG clusters, a set of CpG clusters are chosen tobe markers, whose methylation levels can differentiate most liver tumorfrom both normal liver cell and normal plasma. This task may furtherinclude two steps: (1) selecting those “frequently differentialmethylation regions (FDMR),” in which the methylations are differential(greater than a cutoff) between matched liver tumor and normal livertissue in more than half of the matched paired. This step can removemarkers specific to healthy liver tissues and retain markers specific toliver cancer. (2) Selecting those FDMRs that can distinguish tumorsamples from normal plasma samples. This can be done by selecting theFDMRs whose difference between the medians of its methylation pattern inthe normal class (N-class) and the tumor class (T-class) is greater thana predetermined threshold. This step ensures that the tumormehtylatinsignal can be identified in blood. Given a fixed sequencingcoverage of cfDNAs, the more markers are used, the lower quality thesemarkers may have, but the more tumor-derived cfDNA reads may beidentified. Therefore, there is a tradeoff between the marker's qualityand the amount of tumor cfDNA signals can be used.

In one embodiment, the above mentioned predetermined thresholds for thetwo steps mentioned in the previous paragraph can be 0.1 to 0.9, e.g.,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. In one embodiment, thepredetermined threshold can be 0.1 to 0.3. In yet another embodiment,the threshold is 0.2.

Each marker 112, 114, 116, 118 corresponds to a marker region 122, 124,126, and 128 in the cfDNA methylation profile of the patient. Eachmarker region 122, 124, 126, and 128 has two methylation patterns: onenormal class (N-class) pattern and one tumor class (T-class) pattern.The methylation pattern in any class is modeled as beta function Beta(η,ρ). Specifically, a marker k is associated with two methylationpatterns, i.e., Beta(η_(k) ^(T), ρ_(k) ^(T)) for the class T andBeta(η_(k) ^(N), ρ_(k) ^(N)) for the class N. Note that η_(k) ^(c) andρ_(k) ^(c) are two shape parameters (usually denoted α and β) of a Betadistribution, but here we used the symbols η and ρ to avoid theconfusion with α-value and β-values defined in Introduction section. Theparameters of a Beta distribution can be easily learnt from the samplepopulation of a class, using either the method of moments or maximumlikelihood. To simplify notation, we denote the methylation pattern ofmarker k for class T as m_(k) ^(T)≡Beta(η_(k) ^(T), ρ_(k) ^(T)), and forclass N as m_(k) ^(N)≡Beta(η_(k) ^(N), ρ_(k) ^(N)).

The method 200 further includes calculating the class-specificlikelihood of each cfDNA sequencing read 142. The goal was to classifyeach cfDNA read as class T or N, based on the joint-methylation-statusof multiple CpG sites on the read. The joint-methylation-status 134 in acfDNA read 132 is denoted as r=(r₁, r₂, . . . ), where the binary valuer_(j)=1 or 0 represents methylated or unmethylated status of the CpGsite j in read r. This binary vector r was modelled by theBeta-Bernoulli distribution.

As shown in FIG. 22 , specifically, given a methylation patternm≡Beta(η, ρ) of the marker where read r falls into, the methylationstatus r_(j) of each CpG site j in the read is distributed asr_(j)˜Bernoulli(p), where p is the prior of average methylation rate ofCpG sites within the marker and follows the Beta prior distributionp˜Beta(η, ρ). Using this statistical model, the likelihood of the jointmethylation status in read r=(r₁, r₂, . . . )330, given the methylationpattern m 310 or 320, can be calculated as below:

$\begin{matrix}{{P\left( r \middle| m \right)} = {\prod_{j}{P\left( r_{j} \middle| {{Beta}\left( {\eta,\rho} \right)} \right)}}} \\{= {\prod_{j}{\int_{0}^{1}{{{Bounoulli}\left( r_{j} \middle| p \right)}{{Beta}\left( {\left. p \middle| \eta \right.,\rho} \right)}d\pi}}}} \\{= {\prod_{j}{\int_{0}^{1}{{p^{r_{j}}\left( {1 - p} \right)}^{1 - r_{j}}\frac{{p^{\eta - 1}\left( {1 - p} \right)}^{\rho - 1}}{B\left( {\eta,\rho} \right)}d\rho}}}} \\{= {\prod_{j}\frac{B\left( {{r_{j} + \eta},{1 - r_{j} + \rho}} \right)}{B\left( {\eta,\rho} \right)}}}\end{matrix}$

where B(x, y) is the beta function. Therefore, for marker k withmethylation pattern m_(k) ^(T) of class T and m_(k) ^(N) of class N, wecan use the above formula to compute the class-specific likelihoods ofread r, i.e., P(r|m_(k) ^(T))322 and P(r|m_(k) ^(N)) 312. Note that thislikelihood calculation implements a probabilistic version of α-value forindividual reads.

Method 200 includes a step 130 of predicting tumor-derived cfDNA burden.As illustrated in FIG. 20 , a probabilistic framework was developed toinfer the tumor-derived cfDNA fraction (i.e. tumor burden), denoted as0≤θ<1, by classifying cfDNA reads into two classes (class T fortumor-derived DNAs and class N for normal plasma cfDNAs), based on a setof markers associated with the methylation patterns of two classes.Denoted are the methylation patterns of all K markers as

={(m₁ ^(T), m₁ ^(N)), . . . , (m_(k) ^(T), m_(k) ^(N)), (m_(K) ^(T),m_(K) ^(N))}. It is also denoted that the methylation sequencing data ofa patient's cfDNAs as a set of N reads R={r⁽¹⁾, . . . , r^((N))} that intotal cover M CpG sites. For a read that is aligned to the region ofmarker k, we assume that it can come from one of two classes with theclass-specific likelihood P(r|m_(k) ^(c)), where m_(k) ^(c) is themethylation pattern of class c. Let θ be the tumor-derived cfDNA burden,so the fraction of normal cfDNA is 1−θ. It is desirable to estimate θ bymaximizing the log-likelihood log P(R|θ,

). This is a maximum likelihood estimation problem. The independence ofeach read was assumed, P(R|θ,

)=Π_(i=1) ^(N)P(r^((i))|θ,

). The likelihood P(r^((i))|θ,

) of read r^((i)) is expanded as follows:

P(r ^((i))|θ,

)=θP(r ^((i)) |m _(k) ^(T))+(1−θ)P(r ^((i)) |m _(k) ^(N))

Because P(R|θ,

) has only one parameter θ to be estimated, a grid search can be appliedto exhaustively enumerate all 1000 fraction values which are uniformlydistributed between 0% and 100%, i.e., 0%, 0.1%, . . . , 0.99% and 100%.This method can get the global optimization at the precision of 0.1%,which we think is sufficient for capturing the tiny amount oftumor-derived cfDNAs. Because the grid search is computationally fast,the steps to determine θ at higher resolutions can be readily refined.

Having described the invention in detail, it will be apparent thatmodifications, variations, and equivalent embodiments are possiblewithout departing the scope of the invention defined in the appendedclaims. Furthermore, it should be appreciated that all examples in thepresent disclosure are provided as non-limiting examples.

EXAMPLES

The following non-limiting examples are provided to further illustrateembodiments of the invention disclosed herein. It should be appreciatedby those of skill in the art that the techniques disclosed in theexamples that follow represent approaches that have been found tofunction well in the practice of the invention, and thus can beconsidered to constitute examples of modes for its practice. However,those of skill in the art should, in light of the present disclosure,appreciate that many changes can be made in the specific embodimentsthat are disclosed and still obtain a like or similar result withoutdeparting from the spirit and scope of the invention.

Example 1 Methylation Signatures

Since most public methylation data of GEO and TCGA are microarray data,a type of summary data which can be best modeled at bin level.Therefore, in this work, bin-level methylation data are used, i.e.,Model 1 of FIG. 4 , to characterize the methylation signature of each“cancer-specific” or “tissue-specific” region.

It is noted that the method is flexible to allow two different levels ofmethylation data to be used in the method.

Example 2 Probabilistic Framework of Inferring Plasma cfDNA Compositionswith “Phased” Methylation Data

The second step is to infer plasma cfDNA compositions of T≥2 classes. Asshown in FIG. 3 , for “cancer composition”, T=2 classes refer to normalplasma cfDNAs and a specific tumor type; while for “tissue composition”,T>2 classes refer to T normal tissue types.

Due to the small abundance of cfDNA in plasma and utilize the advantagesof “phased” data, as shown in FIG. 2 , we always perform the methylationsequencing on the patient's plasma cfDNAs. As a result, we have acollection of N cfDNA sequencing reads, covering M CpG sites in allmethylation signatures identified in Component 1 of the method. As shownin FIG. 5 , these methylation data of fragments can be represented by aternary matrix R=(r_(ij))_(xM), where each row (or column) correspondsto a fragment (or CpG site) and each entry r_(ij)∈{0, 1, −}. r_(ij)=0(or 1) indicates the j-th CpG site is covered by the i-th fragment andis unmethylated (or methylated), while r_(ij)=−indicates site j is notcovered by fragment i.

We assume that a patient's plasma cfDNAs is constituted by T knownclasses with methylation signatures Ω^(t) (1≤t≤T) learned in Component 1of the method and a new unknown tissue type consisting of all cfDNAsthat are unlikely to belong to any known classes (for easyunderstanding, we use “tissues” instead of “classes” in the rest of thissection), because it is true in real data that not all cfDNAs arereleased from T known tissues. So a set of methylation signatures aredefined as Ω={Ω¹, Ω² . . . , Ω^(T), Ω^(T+1)}, where Ω^(t) (1≤t≤T) isknown, Ω^(T+1) is unknown and will be estimated by our method. Given theinput matrix R derived from methylation sequencing data of a patientplasma cfDNA, we have two assumptions:

Assumption 1: each cfDNA fragment is released from a tissue cellsubpopulation and the composition of T tissues contributing to theplasma cfDNA is denoted as a composition vector θ=(θ₁, θ₂, . . . ,θ_(T), θ_(T+1)), where θ_(t) (1≤t≤T+1) is the cfDNA proportion of tissuet in the plasma cfDNA and Σ_(t=1) ^(T+1)θ_(t)=1.

Assumption 2: Proportion of cfDNA fragments from a tissue t in the inputmatrix R can reveal the cfDNA fraction θt of tissue t in the plasma.

Based on these assumptions, the input fragment matrix R and thetissue-specific signatures Ω, we aim to maximize the likelihood (R|θ, Ω)for estimating the tissue composition θ in plasma cfDNAs. It is formallyexpressed as below:

$\begin{matrix}{\max\limits_{\Theta}\log{P\left( {\left. R \middle| \Theta \right.,\Omega} \right)}} & (1)\end{matrix}$ ${s.t.{\sum\limits_{t = 0}^{T + 1}\theta_{t}}} = 1$

where θ and Ω^(T+1) (in Ω) are parameters to be estimated. This is atypical maximum likelihood estimation (MLE) problem. The log-likelihoodcan be expanded the summation of the log-likelihood of each cfDNAsequencing read as

$\begin{matrix}{{\log{P\left( {\left. R \middle| \Theta \right.,\Omega} \right)}} = {\sum\limits_{i = 1}^{N}{\log{P\left( {\left. R_{i} \middle| \Theta \right.,\Omega} \right)}}}} & (2)\end{matrix}$

We optimize Eq. (1) by the Expectation-Maximization (EM) algorithm, andintroduce a latent random variable z_(i) for each cfDNA sequencing readR_(i)(the i-th row of the matrix R) to indicate which tissue t this readis released from, i.e., z_(i)=t and t=1,2, . . . , T, T+1. We modelz_(i) to follow the categorical distribution: z_(i)˜Ca(T, θ), andtherefore we have P(z_(i)=t|θ)=θt. We then rewrite the likelihood of acfDNA fragment R_(i) in Eq. (2) as

$\begin{matrix}\begin{matrix}{{P\left( {\left. R_{i} \middle| \Theta \right.,\Omega} \right)} = {\sum\limits_{t = 0}^{T + 1}{P\left( {R_{i},{z_{i} = \left. t \middle| \Theta \right.},\Omega^{t}} \right)}}} \\{= {\sum\limits_{t = 0}^{T + 1}{{P\left( {z_{i} = \left. t \middle| \Theta \right.} \right)}{P\left( R_{i} \middle| \Omega^{t} \right)}}}} \\{= {\sum\limits_{t = 0}^{T + 1}{\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}}}\end{matrix} & (3)\end{matrix}$

where P(R_(i)|Ω^(t)) is the tissue-of-origin likelihood of the cfDNAread R_(i) for tissue t. Let q(t) be the posterior probability ofz_(i)=t. Then we have the following Jensen's inequality and thelog-likelihood function in Eq. (1) can be lower-bounded by the newfunction F(q, θ) defined below.

$\begin{matrix}\begin{matrix}{{\log{P\left( {\left. R \middle| \Theta \right.,\Omega} \right)}} = {\sum\limits_{i = 1}^{N}{\log{P\left( {\left. R_{i} \middle| \Theta \right.,\Omega} \right)}}}} \\{= {\sum\limits_{i = 1}^{N}{\log{\sum\limits_{t = 0}^{T + 1}{\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}}}}} \\{= {{\sum\limits_{i = 1}^{N}{\log{\sum\limits_{t = 0}^{T + 1}{{q_{i}(t)}\frac{\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}{q_{i}(t)}}}}} \leq}} \\{\sum\limits_{i = 1}^{N}{\sum\limits_{t = 0}^{T + 1}{{q_{i}(t)}\log\frac{\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}{q_{i}(t)}}}} \\{= {F\left( {q,\Theta,\Omega^{new}} \right)}}\end{matrix} & (4)\end{matrix}$

where we denote Ω^(T+1) as Ω^(new) for emphasizing it is a parameter,and q is a set of all q_(i)(t) over all i=1, . . . , N and all t overall t=1, . . . , T, T+1. According the EM algorithm, we have thefollowing alternative steps to optimize (q, θ) via coordinate ascent.

$\begin{matrix}{{E - {Step}:q^{({k + 1})}} = {\underset{q}{\arg\max}{F\left( {q,\Theta^{(k)},\Omega^{{new}{(k)}}} \right)}}} & (5)\end{matrix}$ $\begin{matrix}{{M - {Step}:\Theta^{({k + 1})}},{\Omega^{{new}{({k + 1})}}\underset{\Theta,\Omega^{new}}{\arg\max}{F\left( {q^{({k + 1})},\Theta,\Omega^{new}} \right)}}} & (6)\end{matrix}$

Ω^(new) can be the methylation rate at the bin level or CpG site level,denoted as m_(j) ^(new) for bin j (or CpG site j). In the E-step, eachq_(i)(t) of q can be estimated by the posterior probability of z_(i)given R_(i), Ω^(t) and a predefined parameter setting θ^((k)), that is

$\begin{matrix}{{q_{i}(t)} = {{P\left( {{z_{i} = \left. t \middle| \Theta \right.},R_{i},\Omega^{t}} \right)} = \frac{\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}{{\sum}_{t = 1}^{T + 1}\theta_{t}{P\left( R_{i} \middle| \Omega^{t} \right)}}}} & (7)\end{matrix}$

In the M-step, we have

$\begin{matrix}{\theta_{t} = \frac{{\sum}_{i = 1}^{N}{q_{i}(t)}}{{\sum}_{t = 1}^{T + 1}{\sum}_{i = 1}^{N}{q_{i}(t)}}} & (8)\end{matrix}$

When methylation signature is modeled by each CpG site j (Model 2):

$\begin{matrix}{m_{j}^{new} = \frac{{\sum}_{{read}i{covering}{CpG}{site}j}{q_{i}({new})}R_{ij}}{{\sum}_{{read}i{covering}{CpG}{site}j}{q_{i}({new})}}} & (9)\end{matrix}$

When methylation signature is modeled by each bin j (Model 1):

$\begin{matrix}{m_{j}^{new} = \frac{{\sum}_{{{read}i} \in {{bin}j}}{q_{i}({new})}{❘{{methylated}{CpG}{sites}{of}{read}i}❘}}{{\sum}_{{{read}i} \in {{bin}j}}{q_{i}({new})}{❘{{all}{CpG}{sites}{of}{read}i}❘}}} & (10)\end{matrix}$

for each t=1, . . . , T+1. Therefore, Eqs. (7), (8), (9) or (10) are thesolutions of the E-step and M-step, respectively. According the EMalgorithm, starting with a random θ⁽⁰⁾ and Ω^(new(0)), then iterativelyperforming Eqs. (7), (8), (9) or (10) can converge to a local maximum ofthe log-likelihood function. Therefore, we need to run the EM algorithmwith different initial values for many times and choose the solutionwith the maximum log-likelihood. An illustration of this procedure isshown in FIG. 6 .

In the E-step of the above EM algorithm of Eq. (7), a key calculation isP(R_(i)|Ω^(t)), which is the tissue-of-origin likelihood of a cfDNA readR_(i) for tissue t. According to different models of methylationsignatures Ω^(t) (as listed in FIG. 4 ), we have the following methodsof calculating P(R_(i)|Ω^(t)).

Epiallele model without considering inter-individual variances of Ω^(t):We can simply use the relative frequency in the frequency histogram ofmethylation status as P(R_(i)|Ω^(t)). This is illustrated in Model 1 ofFIG. 11 .

CpG site model without considering inter-individual variances of Ω^(t):Let m^(t)=(m₁ ^(t), m₂ ^(t), . . . , m_(M) ^(t)) denote the averagemethylation rate (m_(j) ^(t)) of each tissue-specific CpG site j (1≤j≤M)for tissue t. If we model the methylation status (denoted as r_(ij)=0or 1) of cfDNA read R_(i) in the j-th CpG site follows the Bernoullidistribution r_(ij)˜Bernoulli(m_(j) ^(t)), as shown in Model 2 of FIG.11 , P(R_(i)|Ω^(t)) can be calculated as

$\begin{matrix}{{P\left( R_{i} \middle| \Omega^{t} \right)} = {\underset{j{is}{covered}{by}R_{i}}{\prod\limits_{j = 1}^{M}}{\left( m_{j}^{t} \right)^{r_{ij}}\left( {1 - m_{j}^{t}} \right)^{1 - r_{ij}}}}} & (11)\end{matrix}$

Where r_(ij) is the j-th element of the row vector R_(i).

CpG site model with inter-individual variances of Ω^(t): If we considerthe inter-individual variances of the tissue-specific methylation rateof the j-th CpG site, we can model the Bernoulli random variable r_(ij)with the Beta prior m_(j) ^(t)∝Beta(α_(j) ^(t), β_(j) ^(t)), where theinter-individual variance of each m_(j) ^(t) is characterized by theBeta distribution with the parameters α_(j) ^(t) and β_(j) ^(t). ThenP(R_(i)|Ω^(t)) can be calculated as (shown in Model 2 of FIG. 4 )

$\begin{matrix}{{P\left( R_{i} \middle| \Omega^{t} \right)} = {\underset{j{is}{covered}{by}R_{i}}{\prod\limits_{j = 1}^{M}}\frac{B\left( {{r_{ij} + \alpha_{j}^{t}},{1 - r_{ij} + \beta_{j}^{t}}} \right)}{B\left( {\alpha_{j}^{t},\beta_{j}^{t}} \right)}}} & (11)\end{matrix}$

Bin model without considering inter-individual variances of Ω^(t): Letthe methylation rate m^(t) represent Ω^(t) in a bin. P(R_(i)|Ω^(t)) is asimplified version of Eq. (11) and can be calculated as (shown in Model3 of FIG. 11 )

$\begin{matrix}{{P\left( R_{i} \middle| \Omega^{t} \right)} = {\underset{{CpG}{site}j{is}{covered}{by}R_{i}}{\prod\limits_{j = 1}^{M}}{\left( m^{t} \right)^{r_{ij}}\left( {1 - m^{t}} \right)^{1 - r_{ij}}}}} & (12)\end{matrix}$

Bin model with inter-individual variances of Ω^(t): If we consider theinter-individual variances of the tissue-specific methylation ratesm^(t) of each bin, we can model the Bernoulli random variable r_(ij)with the Beta prior m^(t)˜Beta(α^(t), β^(t)), where the inter-individualvariance of each m^(t) is characterized by the Beta distribution withthe parameters α^(t) and β^(t). Then P(R_(i)|Ω^(t)) can be calculated as(shown in Model 1 of FIG. 4 )

$\begin{matrix}{{P\left( R_{i} \middle| \Omega^{t} \right)} = {\underset{{CpG}{site}j{is}{covered}{by}R_{i}}{\prod\limits_{j = 1}^{M}}\frac{B\left( {{r_{ij} + \alpha^{t}},{1 - r_{ij} + \beta^{t}}} \right)}{B\left( {\alpha^{t},\beta^{t}} \right)}}} & (13)\end{matrix}$

Example 3 Sample Collection and Analysis

Collecting DNA methylation data from public tumor tissue samples forcharacterizing methylation signatures: We collect all the publiclyavailable methylation data of both solid tumors and plasma cfDNA samplesof normal people and cancer patients. Since the majority of tumormethylation profiles of the TCGA (The Cancer Genome Atlas) database areassayed by Infinium HumanMethylation450 microarray, we collect onlymethylation microarray data for solid tumors, covering 2 tissue types:liver (LIHC) and lung (cancer types include LUAD and LUSC). As ofDecember 2013, we collected 169 LIHC tumors, and 450/359 LUAD/LUSCtumors. Note that we used only normal tissue samples from RoadMapproject. In future, we will use the microarray methylation data of solidnormal tissue samples from GEO and TCGA.

Here, LIHC stands for Liver Hepatocellular Carcinoma; LUAD stands forLung Adenocarcinoma; and LUSC stands for Lung Squamous Cell Carcinoma.

Building CpG clusters (features): Most public methylation data (TCGA andGEO) are Infinium HumanMethylation450 microarray data. In this study, weused as features the clusters of CpG sites (covered by probes), i.e.,genomic regions of variable sizes. It is due to the low sequencingcoverage of patients' plasma cfDNA data (˜4× on average) which are fromthe only source: Chan et al. (2013). The CpG clusters can cover moresequencing reads and thus can generate more reliable methylationmeasurements than using individual CpG sites. In detail, each CpGcluster is a genomic bin by grouping all individual adjacent CpG probes(sites) within this bin, which must satisfy two criteria: (i) thegenomic distance between two adjacent CpG probes in a cluster is lessthan 700 bp, and (ii) each cluster has at least 10 CpG probes. The twonumbers are carefully chosen to make as many CpG probes (of microarraydata) clustered as possible for sufficient coverage of WGBS reads, whilekeeping the cluster's genomic size as small as possible. It is observedthat most clusters are associated with only one gene, according to theannotation of the probes. This yields 11,572 CpG clusters, which includeabout ⅓ of all the CpG probes on the Infinium HumanMethylation450microarray. These CpG clusters are used for selecting signature featuresin Component 1 of the method shown in FIG. 12 .

Collecting DNA methylation data of public plasma cfDNA samples: The onlymajor public data source of methylation data of plasma cfDNA samples wecan find are from a recent study of plasma cfDNA methylation analysis³.In this study, Chan et al. (2013) has deposited their whole-genomebisulfite sequencing (WGBS) data of a set of plasma samples, including32 normal people, 8 patients infected by chronic hepatitis B virus(HBV), 26 liver cancer patients, and 4 lung cancer patients, which arein total 70 plasma samples. Note that these WGBS data have lowsequencing coverage (˜4× on average). Bismark is employed to align thereads to the reference genome HG19 and call the methylated cytosines⁴.After the removal of PCR duplications, the numbers of methylated andunmethylated cytosines are counted for each CpG site. Then a CpGcluster's methylation level is calculated as the ratio between the totalnumbers of methylated and all cytosines within the cluster.

Methylation cancer-specific and tissue-specific signatures: We used both16 normal people's plasma cfDNA WGBS data (from Chan et al. (2013) data)and solid tumors (from TCGA data) to obtain 1,452 cancer-specificmethylation signatures which are modeled by variable-size bins (CpGclusters defined above and Model 1 of FIG. 4 ). For tissue-specificmethylation signatures, we used 5,820 biomarkers learned from RoadMapdata by Sun et al. (2015), which include 14 tissues and are modeled by500 bp bins. These biomarkers are learned from only one individual inRoadMap data, not from a population.

Please note that in this preliminary results, we did not include the newunknown tissue type into the cfDNA composition inference algorithm inStep 3. In future work, we will use them.

Example 4 Prediction Performance

Results of Simulated Methylation Data of cfDNA Cancer Patients

We simulated the WGBS data of a patient's plasma cfDNA sample, by mixingthe aligned reads from a healthy plasma sample's WGBS data and a solidtumor sample's WGBS data, with predefined sequencing coverage ˜2×and at8 different tumor DNA burdens: 0.5%, 0.8%, 1%, 3%, 5%, 10%, 15% and 20%.We used 4 normal plasma samples (from Chan et al (2013)), 1 solid lungtumor sample (from TCGA), and 1 solid liver tumor sample (from Chan etal (2013)). In total, there are 4×(1+1)×8=64 simulated cancer patients'plasma samples.

By including the rest of 12 normal plasma samples (i.e., out of 32samples, 16 samples were used for learning signatures and 4 samples forgenerating simulation data), we perform our method that gives aprediction score to each simulated sample or each of 12 normal plasmasample. Then we evaluate the healthy-or-cancer prediction performance bythe Area Under Curve (AUC) of Receiver Operating Characteristic (ROC).In FIG. 8 , we plotted the AUCs for simulation data with different tumorDNA fractions. For representation simplicity, the cancer compositionsare called “cancer dimension”, while the normal tissue compositions arecalled “tissue dimension”. It can be observed that using both dimensions(gray line) outperforms using either cancer-dimension (blue line) ortissue-dimension (orange line) at all tumor burdens. In addition, ourmethod with two dimensions can achieve the AUC around 75% at the tumorburden 5% of plasma cfDNA.

Results of Real cfDNA Cancer Patients

As aforementioned, we used 16 normal plasma samples (out of all 32normal samples in Chan et al. (2013) data) for deriving cancer-specificmethylation signatures. Therefore, in this experiment, we randomlyselected 8 out of the rest 16 normal plasma samples for building theempirical distribution of tissue and tumor compositions for normalplasma population that is used in Z-score based integration method. Thenwe used the last 8 normal samples in test set. The test set alsoincludes 8 plasma samples which were infected by chronic hepatitis Bvirus (we regard them as noncancerous or normal samples), 26 livercancer patients, and 4 lung cancer patients. Because of the randompartition of normal plasma samples, the test set is different for everyrandom partition. So we evaluate the method on different test sets anduse the average performance score (AUC or confusion matrix) as the finalresult.

We evaluate two prediction tasks: (i) binary classification—a newpatient is healthy or gets cancer; and (ii) multi-class classification—anew patient is classified to normal, liver cancer or lung cancer. Theresults are shown in FIG. 9 . We observe the following conclusions:

Tissue dimension works better than tumor dimension in multi-classclassification task, but worse in binary classification task. This isnot surprising, because (i) the carcinogenesis mechanisms are common tomany cancer types, which make cancer methylation signatures are not sodifferential between cancer types; on the contrary, (ii) as evidenced byliteratures, many methylation patterns are tissue-specific and cancontribute to tissue-of-origin prediction.

Integrating both dimensions can further increase multi-classificationclassification task. This observation highlights the importance ofintegrating as many tumor-relevant clues as possible in plasma cfDNAs.

Example 5 Monitoring Cancer Patients

In addition to cancer diagnosis, our method is also effective inmonitoring the cancer progression of patients by their prediction score.The prediction score intuitively describes how far the tumor and tissuecomposition of the patient's plasma cfDNAs is from normal people.Therefore, the larger the prediction score is, the later cancer stage(or the more serious) the patient is.

Two liver patients with their longitudinal plasma cfDNA WGBS data (fromChan et al. (2013) data) were used. Before tumor resection, Patient 1(sample ID: TBR36) has the prediction score 34.2, much higher thanaverage score (i.e., zero) of normal people. After 3 days, 3/6/12 monthsfollowing surgery, the prediction score immediately return to the normalrange (around zero). As shown in FIG. 10 , this is also observable forthe composition of 14 tissues in this patient's plasma cfDNA sample.However, Patient 2 (sample ID: TBR34) still keeps high prediction scoresat 3 days and 2 months after operation, 10.8 and 12.9 respectively. Thispatient passed away later.

Example 6 Material and Methods for Examples 7-9

Overview

The goal of this approach is to classify each read (in the methylationmarker regions) into either the tumor-derived cfDNA class (abbreviatedas class T) or the normal-plasma-derived cfDNA class (abbreviated asclass N). Here, one type of cancer, liver cancer, is the focus but themethod can be generalized to any cancer type. This approach comprisesthree major steps: (i) Identifying the DNA methylation signatures ofliver cancer. The methylation markers of liver cancer were derived basedon DNA methylation data of liver tumors and their matched normal tissuesas well as normal plasma cfDNA samples. The vast amount of methylationdata was collected from the public database TCGA (The Cancer GenomeAtlas (Weinstein et al. (2013)) and recent literatures Chan et al.(2013). (ii) Calculating the likelihood for a read to harbor amethylation signature. Methylation sequencing was performed on theplasma cfDNA sample of a new patient. Sequencing reads were obtained ofthose cfDNA fragments that fall into the genomic regions of selectedmarkers. To account for data uncertainty and inter-individualmethylation variances in markers, the likelihood of each read to comefrom each class was calculated. (iii) Inferring cfDNA composition. Thelikelihood of each read to come from each class can be used to derivethe tumor burden in cfDNAs. FIG. 21 gives an overall picture of ourapproach, and we detail individual steps in the sections below.

Identify and Characterize Methylation Markers Specific to Liver Cancer

A methylation marker includes two kinds of information: its genomicregion and methylation patterns in both solid tumor samples (class T)and normal plasma cfDNA samples (class N). To take advantage of thelarge amount of public methylation data from TCGA that were mainlygenerated by the microarray platform, the following two-step procedurewas developed to obtain the liver-cancer-specific methylation markers.

Identify Genomic Markers for Liver Cancer

Only genomic regions that are covered by sufficient microarray probesqualified as potential markers. Therefore, the definition of CpGclusters in our recent work (Kang et al. (2017), which is herebyincorporated by reference) is used to identify all potential genomicregions. See the Examples below for details. Among all potentialregions, those regions were selected whose methylation levels candifferentiate most liver tumor samples from not only their matchednormal liver tissues but also from normal plasma samples. This taskinherently includes two steps: (i) Selecting those “frequentlydifferential methylation regions (FDMR)”, in which the methylations aredifferential (greater than a cutoff) between matched tumor and normaltissues in more than half of the matched pairs. This step can removemarkers specific to liver tissues, but retain markers specific to livercancer. (ii) Selecting those FDMRs that can distinguish tumor samplesfrom normal plasma samples, i.e. the difference between the medians ofits methylation levels in two classes is greater than a cutoff. Thisstep ensures that the tumor methylation signal can be identified inblood. Given a fixed sequencing coverage of cfDNAs, the more markersused (that is, the larger the panel size), the lower quality thesemarkers may have, but the more tumor-derived cfDNA reads that may beidentified. Therefore, there is a trade-off between the markers' qualityand the amount of tumor cfDNA signals that can be used. In this work,because all public plasma cfDNA samples have low sequencing coverages(1×-3×), the cutoff of the methylation difference in both steps waschosen as 0.2 in order to keep relatively good marker quality andmaintain a large enough size for the methylation marker panel to capturesufficient tumor cfDNAs at this low sequencing coverage.

Characterize Methylation Patterns

In each marker region identified in Step 1, the inter-individualvariance of methylation levels in each class (T and N) are considered.Given a region, the methylation levels of all samples in a class weremodelled to follow a Beta distribution Beta(η, ρ), which has been widelyused in methylation data analyses. Specifically, a marker k isassociated with two methylation patterns, i.e., Beta(η_(k) ^(T), ρ_(k)^(T)) for the class T and Beta(η_(k) ^(N), ρ_(k) ^(N)) for the class N.Note that η_(k) ^(c) and ρ_(k) ^(c) are two shape parameters (usuallydenoted α and β) of a Beta distribution, but here we used the symbols ηand ρ to avoid the confusion with α-value and β-values defined inIntroduction section. The parameters of a Beta distribution can beeasily learnt from the sample population of a class, using either themethod of moments or maximum likelihood (Bowman et al. (2007). Tosimplify notation, the methylation pattern of marker k for class T asm_(k) ^(T)≡Beta(η_(k) ^(T), ρ_(k) ^(T)), and for class N as m_(k)^(N)≡Beta(η_(k) ^(N), ρ_(k) ^(N)) was denoted.

Calculate the Class-Specific Likelihood of Each cfDNA Sequencing Read

The goal was to classify each cfDNA read as class T or N, based on thejoint-methylation-status of multiple CpG sites on the read. Thejoint-methylation-status in a cfDNA read is denoted as r=(r₁, r₂, . . .), where the binary value r_(j)=1 or 0 represents methylated orunmethylated status of the CpG site j in read r. This binary vector rwas modelled by the Beta-Bernoulli distribution (Shah et al. (2015)).Specifically, given a methylation pattern m Beta(η, ρ) of the markerwhere read r falls into, the methylation status r_(j) of each CpG site jin the read is distributed as r_(j)˜Bernoulli(p), where p is the priorof average methylation rate of CpG sites within the marker and followsthe Beta prior distribution p˜Beta(η, ρ). Using this statistical model,the likelihood of the joint methylation status in read r=(r₁, r₂, . . .), given the methylation pattern m, can be calculated as below:

$\begin{matrix}{{P\left( r \middle| m \right)} = {\prod_{j}{P\left( r_{ij} \middle| {{Beta}\left( {\eta,\rho} \right)} \right)}}} \\{= {\prod_{j}{\int_{0}^{1}{{Bounoulli}\left( r_{j} \middle| p \right){{Beta}\left( {\left. p \middle| \eta \right.,\rho} \right)}{dp}}}}} \\{= {\prod_{j}{\int_{0}^{1}{{p^{r_{j}}\left( {1 - r} \right)}^{1 - r_{j}}\frac{{p^{\eta - 1}\left( {1 - p} \right)}^{\rho - 1}}{B\left( {\eta,\rho} \right)}{dp}}}}} \\{= {\prod_{j}\frac{B\left( {{r_{j} + \eta},{1 - r_{j} + \rho}} \right)}{B\left( {\eta,\rho} \right)}}}\end{matrix}$

where B(x, y) is the beta function. Therefore, for marker k withmethylation pattern m_(k) ^(T) of class T and m_(k) ^(N) of class N, wecan use the above formula to compute the class-specific likelihoods ofread r, i.e., P(r|m_(k) ^(T)) and P(r|m_(k) ^(N)). Note that thislikelihood calculation implements a probabilistic version of α-value forindividual reads. An example is illustrated in FIG. 22 .

Predict Tumor-Derived cfDNA Burden

As illustrated in FIG. 21 , a probabilistic framework was developed toinfer the tumor-derived cfDNA fraction (i.e. tumor burden), denoted as0≤θ<1, by classifying cfDNA reads into two classes (class T fortumor-derived DNAs and class N for normal plasma cfDNAs), based on a setof markers associated with the methylation patterns of two classes.Denoted are the methylation patterns of all K markers

={(m₁ ^(T), m₁ ^(N)), . . . , (m_(k) ^(T), m_(k) ^(N)), . . . , (m_(K)^(T), m_(K) ^(N))}. We also denote the methylation sequencing data of apatient's cfDNAs as a set of N reads R={r⁽¹⁾, . . . , r^((N))} that intotal cover M CpG sites. For a read that is aligned to the region ofmarker k, we assume that it can come from one of two classes with theclass-specific likelihood P(r|m_(k) ^(c)), where m_(k) ^(c) is themethylation pattern of class c. Let θ be the tumor-derived cfDNA burden,so the fraction of normal cfDNA is 1−θ. It is desirable to estimate θ bymaximizing the log-likelihood log P(R|θ,

). This is a maximum likelihood estimation problem. The independence ofeach read was assumed (as widely adopted in literatures (Yuan et al.(2015), Landau et al. (2014)), P(R|θ,

)=Π_(i=1) ^(N) P(r^((i))|θ,

). The likelihood P(r^((i))|θ,

) of read r^((i)) is expanded as follows:

P(r ^((i))|θ,

)=θP(r ^((i)) |m _(k) ^(T))+(1−θ)P(r ^((i)) |m _(k) ^(N))

Because P(R|θ,

) has only one parameter θ to be estimated, a grid search can be appliedto exhaustively enumerate all 1000 fraction values which are uniformlydistributed between 0% and 100%, i.e., 0%, 0.1%, . . . , 0.99% and 100%.This method can get the global optimization at the precision of 0.1%,which we think is sufficient for capturing the tiny amount oftumor-derived cfDNAs. Because the grid search is computationally fast,the steps to determine θ at higher resolutions can be readily refined.

Removal of “germline” markers: Above, a global tumor burden (θ) acrossall cancer-specific markers was estimated. The tumor burden (θ) can alsobe estimated only for a single marker. Ideally, for an early-stagecancer patient, the estimated θ should be a small number (e.g., <20%),either across all markers or in individual markers. However, in realcancer patient data, we observed a number of markers with individuallyestimated tumor burdens far larger than the global tumor burden.Therefore, cfDNA fragments harboring aberrant methylation in these“outlier” markers obviously do not come from cancerous cells, but likelyfrom normal cells (e.g. white blood cells) due to inter-individualvariance (e.g. age, environment exposure, or other diseases the personmay have). Such methylation abnormalities are conceptually similar to“germline” mutations. Consequently, including these “germline” markerswould impair the accuracy of tumor burden estimation. An iterativealgorithm was designed to adjust the global tumor burden afteridentifying and removing “germline” markers. Ok is denoted as the tumorburden at the marker k, to distinguish from the global burden θ obtainedusing all markers. The procedure of this algorithm is presented below:

Initialization—Let

denote the set of markers used for θ estimation. Initially, all markersare pulled into

.

Remove “germline” markers—Estimate θ_(k) of each marker k in

and calculate the standard deviation of all θ_(k), denoted asstd(θ_(k)). Remove from

those markers whose θ_(k)>θ+λstd(θ_(k)), where λ is an input fixedparameter.

Update θ—Estimate the global burden θ using all markers of

updated in Step 1.

Iterate Steps 1 and 2, Until θ Converges.

The output θ is the adjusted global tumor burden after removing“germline” markers. The parameter λ of this algorithm controls how farthe θ_(k) of those “germline” markers deviates from the average θ. Thisparameter can be estimated using normal plasma cfDNA samples, because itis expected that the optimal λ should be able to adjust the global θ ofthe normal samples to be close to zero.

Methylation Data Collection, Generation and Processing

Data collection: We collected the methylation profiles of 49 solid livertumor samples and their matched adjacent solid liver tissue samples fromthe TCGA database. All of these samples were assayed using the InfiniumHumanMethylation450 microarray. For the plasma cfDNA samples, themethylation sequencing data from Chan et al. (2013) and Sun et al.(2015) were used. They include the Whole Genome Bisulfite Sequencing(WGBS) data of plasma samples taken from 32 healthy people, 8 patientsinfected with chronic hepatitis B virus (HBV), and 29 liver cancerpatients.

Data generation: Since the public WGBS data of plasma cfDNA samples havevery low sequencing coverage (1×˜3×), WGBS data of plasma cfDNA sampleswas generated, at higher coverage (˜10× on average), from two healthypeople; and generated WGBS data of solid tumor samples from two cancerpatients, in order to simulate higher-coverage cfDNA WGBS data fromcancer patients. Blood samples were centrifuged at 1600×g for 10 minutesand then the plasma was transferred into new microtubes and centrifugedat 16,000×g for another 10 minutes. The plasma was collected and storedat −80° C. cfDNA was extracted from 5 ml plasma using the Qiagen QIAampCirculating Nucleic Acids Kit and quantified using a Qubit 3.0Fluorometer (Thermo Fisher Scientific). Bisulfite conversion of cfDNAwas performed using the EZ-DNA-Methylation-GOLD kit (Zymo Research).After that, an Accel-NGS Methy-Seq DNA library kit (Swift Bioscience)was used to prepare the sequencing libraries. The DNA libraries werethen sequenced with 150-bp paired-end reads. For the solid tumorsamples, bisulfite conversion was performed with theEZ-DNA-Methylation-GOLD kit (Zymo Research), and the sequencinglibraries were prepared using the TruSeq DNA Methylation Kit. The DNAlibraries were then sequenced with 150-bp paired-end reads using HiSeq X(Illumina).

Processing methylation microarray data: The microarray data (level 3 inTCGA database) provide the methylation levels of individual CpG sites.The methylation level of a CpG cluster is defined as the averagemethylation level of all CpG sites in the cluster. A cluster'smethylation level is marked as “not available” (NA) if more than half ofits CpG sites do not have methylation measurements.

Processing WGBS data: Bismark (Krueger et al. (2011)) was used to alignthe reads to the reference genome hg19 and call the methylatedcytosines. After the removal of PCR duplicates, the numbers ofmethylated and unmethylated cytosines were counted for each CpG site.The methylation level of a CpG cluster is calculated as the ratiobetween the number of methylated cytosines and the total number ofcytosines within the cluster. However, if the total number of cytosinesin the reads aligned to the CpG cluster is less than 30, the methylationlevel of this cluster is treated as NA (Not Available). This WGBS dataprocessing procedure is used for calculating the average methylationlevel of a CpG cluster in normal plasma samples that are used foridentifying methylation markers. When a plasma cfDNA sample is used astest data, the joint-methylation-status of all CpG sites of individualsequencing reads that are aligned to the regions of the marker panelfrom Bismark's output was extracted, then fed this information intoCancerDetector as its input data. Because the sequencing coverage ofreal data is low, in this work, we used all reads covering at least oneCpG site. For the cfDNA methylation data with high coverage, we canfilter out those reads covering <3 CpG sites to improve the input dataquality.

Example 7 Identify Methylation Markers Specific to Liver Cancer

Defining all genomic regions eligible to serve as methylation markers:Our training data are from the TCGA solid tissues, measured by theInfinium HumanMethylation450 microarray with ˜450,000 CpGs. However, ourtesting data (Chan et al. (2013), Sun et al. (2015)) are WGBS data withvery low sequencing coverage. Therefore, the CpG sites were grouped intoCpG clusters in order to use more mappable reads from the testing data.For a CpG site covered by a probe on the microarray, the region 100 bpup- and down-stream were defined as its flanking region and assume thatall CpG sites located within this region have the same averagemethylation level as the CpG sites covered by probes. Two adjacent CpGsites are grouped into a CpG cluster if their flanking regions overlap.Finally, only those CpG clusters containing at least three CpGs coveredby microarray probes are used, in order to achieve robust measurement ofmethylation levels. This procedure yielded 42,374 CpG clusters, whichtogether include about one-half of all the CpG sites on the InfiniumHumanMethylation450 microarray. Most of these clusters are eachassociated with only one gene. These CpG clusters are used forsubsequent feature selection.

Selecting liver-cancer-specific markers and characterizing theirpatterns in normal and tumor classes: Given the 42,374 CpG clusters, thecancer-specific markers were selected by using the method described inExample 6 on the training data: (i) 49 pairs of solid liver tumors andtheir matched normal liver tissues and (ii) 75% of all the 32 healthyplasma samples. Note that the remaining 25% of healthy plasma samplesare used as test data, and the healthy plasma samples were randomlypartitioned in the ratio of 75/25 as the training/test data,respectively, 10 times. This indicates that there were 10 different setsof training/test data and each set can yield different selected markersand tumor burden estimation. Each set of training/test data and itsresult is called an experimental run. In each of the 10 runs, on averageof 3,214 liver-cancer-specific markers (CpG clusters) were identified,and the majority of these markers were shared by all runs. Themethylation patterns for each marker in the normal and tumor classeswere classified as two Beta distributions, with learnt shape parametersthat can capture the inter-individual variance of methylation levelswithin a class.

Example 8 Simulation Experiments Demonstrated the Ultrasensitivity ofCancerDetector in Detecting Tumor cfDNAs

The methylation data of a plasma cfDNA sample was simulated by samplingand mixing the methylation sequencing reads of two real samples, anormal plasma cfDNA sample and a solid tumor sample, at a variety oftumor burdens (θ) and different sequencing coverages (c). This strategycan allow us to mimic real data and precisely control the tumor burdenand sequencing coverage in the mixture samples, in order to test thepower and requirement of the CancerDetector method, e.g. at what tumorburden and sequencing coverage can tumor-derived cfDNAs be detected.CancerDetector was compared with another probabilistic cfDNAdeconvolution method, “CancerLocator” (Kang et al. (2017)) and USN62/473,829, which is hereby incorporated by reference), that inventorsrecently developed and is so far the only method aimed at deconvolutingcancer signals from cfDNA methylation data. While CancerDetector is aread-based method, CancerLocator is based on traditional β-values bydeconvolving β-values of markers in the cfDNAs as a linear combinationof the β-values of two classes (tumor or normal cfDNAs).

To compare the sensitivity of the two methods in identifying a minortrace of tumor cfDNAs (i.e., θ≤5%), plasma cfDNA samples at 8 differenttumor burdens (θ=0, 0.1%, 0.3%, 0.5%, 0.8%, 1%, 3% and 5%), and 3different sequencing coverages (c=2, 5, 10) were simulated. The realsamples used in the simulation procedure are the WGBS data of two normalplasma samples (N1L and N2L) and of two solid liver tumor samples (HCC1and HCC2). This experimental setting results in 8×3×2×2=96 mixedsamples. FIG. 23 demonstrates the sensitivity of the two methods indetecting tumor cfDNAs, where scatter plots are shown for the predictedtumor burdens averaged over 10 experimental runs of the simulatedsamples with 8 given tumor burdens at three given sequencing coverages(2×, 5× and 10×). As clearly shown in FIG. 23 , the blood tumor burdenspredicted by CancerDetector 410 are highly consistent with the truevalues and have very low prediction variances: e.g., when using thehighest sequencing coverage 10×, CancerDetector 410 achieved a Pearson'sCorrelation Coefficient (PCC) of 0.9974±0.0012 (P-value=7.2E-8±9.8E-8),averaged over 10 runs. The consistency increased with the sequencingcoverage, i.e., average PCC=0.9811, 0.9959, 0.9974 and their associatedP-values 2.5E-5, 6.0E-6, 9.8E-8 for the sequencing coverages of 2×, 5×,10×, respectively. More importantly, it can be observed thatCancerDetector 410 can (i) detect tumor cfDNAs with a low tumor burden(0=1%) at low sequencing coverage (2×), and (ii) improve the detectionlimit from 1% to 0.3% when increasing the sequencing coverages (5× and10×). On the other hand, the β-value-based method, CancerLocator 420,cannot detect any tumor DNAs when the tumor burden θ is <5% and 2×sequencing coverage, or 0<3% with 5× coverage. Even with 10× sequencingcoverage, its predictions are not stable (there is high predictionvariance) and deviate strongly from the true tumor burdens. In summary,this result demonstrates that the read-based CancerDetector 410 methodcan sensitively detect a small amount of tumor cfDNAs, even at lowsequencing coverage, and the prediction accuracy and stability increasewith higher sequencing coverage.

Example 9 Testing on Real Data Confirmed the High Sensitivity ofCancerDetector in Deconvoluting Tumor cfDNA

A collection of public plasma samples (32 healthy people, 8 HBV carriersand 29 liver cancer patients) was collected from Chan et al. (2013) andSun et al. (2015). These data have low sequencing coverages (1×˜3×). The32 healthy plasma samples were randomly split into training set (75%)and test set (25%) 10 times (runs). In each run, using the combinedtraining set and TCGA microarray data of solid liver tumors and matchednormal tissues, the liver-cancer-specific methylation markers wereidentified and then predicted tumor burdens in the test set: the plasmasamples from 8 HBV carriers, 29 liver cancer patients, and the remaining25% of healthy subjects. The performance of predicting tumor burdens canbe measured by the Receiver Operating Characteristic (ROC) curve, wherethe sensitivity and specificity of separating cancer and non-cancersamples are calculated and plotted by using different tumor burdencutoffs. As shown in FIGS. 24A and 24B, the average ROC curve ofCancerDetector 510 outperforms that of CancerLocator 520 in terms ofboth prediction performance and robustness (i.e., much lower standarddeviations). For example, when we chose the point of the top-left cornerin the ROC curve for determining the tumor burden threshold, at thespecificity of 100% CancerDetector 510 can achieve an averagesensitivity of 95.2% across 10 runs with standard deviation 3.2%; whilethe β-value based CancerLocator 520 method achieved on average asensitivity of 74.1% with standard deviation 10.7% at the specificity of100%. Note that there are at least 25 early-stage (Barcelona ClinicLiver Cancer stage A) patients among the 29 liver cancer patients.Testing only on the 25 early-stage cancer patients and healthy/HBVsamples, at the specificity of 100% our method can also achieve anaverage sensitivity of 94.4% with a standard deviation of 3.7%; whileCancerLocator 520 obtained a sensitivity of 74.4% with a standarddeviation 10.0%. Summarizing the performance comparison using the AreaUnder Curve (AUC), our method can achieve an AUC of 0.988 averaged over10 runs with standard deviation 0.004 for all real samples and anaverage AUC of 0.987 with standard deviation 0.005 for early-stagecancer patients; while CancerLocator 520 obtained a lower average AUC of0.975 with standard deviation 0.014 for real samples and an average AUCof 0.975 with standard deviation 0.0143 for early-stage cancer patients.CancerDetector 510 correctly predicted the cfDNA tumor burdens of alleight chronic hepatitis B virus (HBV) samples to be the same range ofthe normal samples (i.e., close to zero) that are well distinguishedfrom cancer samples. These results demonstrated that CancerDetector 510can go beyond distinguishing healthy samples from cancer samples andhandle more sophisticated scenarios, such as differentiating HBVcarriers from cancer patients. Therefore, using real plasma samples wealso demonstrated that the read-based CancerDetector method can moresensitively detect tumor cfDNAs.

In general, the predicted tumor burden correlates well with tumor size.As shown in FIG. 24C, among the 26 liver cancer patients with tumor sizeinformation, the PCC between the predicted tumor burden and tumor sizeis 0.87 (P-value=7.37e-09). Even after removing the three patients withthe largest tumors (size>6 cm), we still get a relatively good PCC of0.42 (P-value=4.61e-02).

CancerDetector can also be used for monitoring the cancer progressionand treatment. We used two cancer liver patients from Chan et al. (2013)whose plasma samples were obtained before surgical tumor resection andat multiple time points after the surgery. The first patient survivedbeyond 12 months, while the second patient died of metastatic diseaseafter the operation (Chan et al. (2013)). As shown in FIG. 25 , thepredicted blood tumor burdens are consistent with the treatment effects:the first patient's tumor burdens quickly fall into the normal range;while those of the second retain relatively high values after thesurgery.

The success of early cancer detection largely relies on (i) thehigh-quality cancer-specific methylation markers, and (ii) thecomputational method for the ultra-sensitive detection of tiny amountsof tumor cfDNAs (usually <5%, or even <0.5% in early-stage cancerpatients). In some embodiments, there is a method to deconvolute thetumor cfDNA out of total cfDNA at the resolution of individual reads.Compared with traditional cancer detection methods, our method has twoadvantages in identifying subtle tumor signals: (i) Exploit thepervasive nature of DNA methylation to significantly amplify aberrantcfDNA signals: As demonstrated in FIG. 1 and FIG. 20 and in ourexperimental results, the joint methylation status (α-value) of multipleCpG sites in a read carries more sensitive tumor signals than theaverage methylation rate (β-value) of an individual CpG site. Ourprobabilistic method based on α-value is particularly advantageous whentumor burdens and sequencing coverages are low. (ii) Jointly deconvolutetumor burden across all markers. Existing methods often focus ondetecting tumor signals in single tumor markers, not aggregating signalsfrom all markers (Sharma (2009), Sturgeon et al. (2009)). Alternatively,our method holds the belief that subtle tumor signals should occur atmultiple places in the genome. Although our method can detect tumorcfDNA at the read level, all possible signals were combined to provide arobust and sensitive estimate of the tumor burden. The keyconsiderations of (i) and (ii) promised, as demonstrated, that ourmethod could do well for extremely low tumor burdens (<1%) at low ormedium sequencing coverage (5× and 10×). Therefore, the approach holdsthe potential to largely reduce the cost of cancer detection.

Furthermore, the skilled artisan will recognize the applicability ofvarious features from different embodiments. Similarly, the variouselements, features and steps discussed above, as well as other knownequivalents for each such element, feature or step, can be mixed andmatched by one of ordinary skill in this art to perform methods inaccordance with principles described herein. Among the various elements,features, and steps some will be specifically included and othersspecifically excluded in diverse embodiments.

Although the invention has been disclosed in the context of certainembodiments and examples, it will be understood by those skilled in theart that the embodiments of the invention extend beyond the specificallydisclosed embodiments to other alternative embodiments and/or uses andmodifications and equivalents thereof.

Many variations and alternative elements have been disclosed inembodiments of the present invention. Still further variations andalternate elements will be apparent to one of skill in the art.

In some embodiments, the numbers expressing quantities of ingredients,properties such as molecular weight, reaction conditions, and so forth,used to describe and claim certain embodiments of the invention are tobe understood as being modified in some instances by the term “about.”Accordingly, in some embodiments, the numerical parameters set forth inthe written description and attached claims are approximations that canvary depending upon the desired properties sought to be obtained by aparticular embodiment. In some embodiments, the numerical parametersshould be construed in light of the number of reported significantdigits and by applying ordinary rounding techniques. Notwithstandingthat the numerical ranges and parameters setting forth the broad scopeof some embodiments of the invention are approximations, the numericalvalues set forth in the specific examples are reported as precisely aspracticable. The numerical values presented in some embodiments of theinvention may contain certain errors necessarily resulting from thestandard deviation found in their respective testing measurements.

In some embodiments, the terms “a” and “an” and “the” and similarreferences used in the context of describing a particular embodiment ofthe invention (especially in the context of certain of the followingclaims) can be construed to cover both the singular and the plural. Therecitation of ranges of values herein is merely intended to serve as ashorthand method of referring individually to each separate valuefalling within the range. Unless otherwise indicated herein, eachindividual value is incorporated into the specification as if it wereindividually recited herein. All methods described herein can beperformed in any suitable order unless otherwise indicated herein orotherwise clearly contradicted by context. The use of any and allexamples, or exemplary language (e.g. “such as”) provided with respectto certain embodiments herein is intended merely to better illuminatethe invention and does not pose a limitation on the scope of theinvention otherwise claimed. No language in the specification should beconstrued as indicating any non-claimed element essential to thepractice of the invention.

Groupings of alternative elements or embodiments of the inventiondisclosed herein are not to be construed as limitations. Each groupmember can be referred to and claimed individually or in any combinationwith other members of the group or other elements found herein. One ormore members of a group can be included in, or deleted from, a group forreasons of convenience and/or patentability. When any such inclusion ordeletion occurs, the specification is herein deemed to contain the groupas modified thus fulfilling the written description of all Markushgroups used in the appended claims.

Furthermore, numerous references have been made to patents and printedpublications throughout this specification. Each of the above citedreferences and printed publications are herein individually incorporatedby reference in their entirety.

In closing, it is to be understood that the embodiments of the inventiondisclosed herein are illustrative of the principles of the presentinvention. Other modifications that can be employed can be within thescope of the invention. Thus, by way of example, but not of limitation,alternative configurations of the present invention can be utilized inaccordance with the teachings herein. Accordingly, embodiments of thepresent invention are not limited to that precisely as shown anddescribed.

REFERENCES

-   Li, S. et al. Dynamic evolution of clonal epialleles revealed by    methclone. Genome Biol. 15, 472 (2014).-   Yuan, K. et al. BitPhylogeny: a probabilistic framework for    reconstructing intra-tumor phylogenies. Genome Biol. 16, 36 (2015).-   Zheng, X. et al. MethylPurify: tumor purity deconvolution and    differential methylation detection from single tumor DNA methylomes.    Genome Biol. 15, 419 (2014).-   Sun, K. et al. Plasma DNA tissue mapping by genome-wide methylation    sequencing for noninvasive prenatal, cancer, and transplantation    assessments. Proc. Natl. Acad. Sci. U.S.A 112, E5503-12 (2015).-   Houseman, E. A. et al. DNA methylation arrays as surrogate measures    of cell mixture distribution. BMC Bioinformatics 13, 86 (2012).-   Koestler, D. C. et al. Blood-based profiles of DNA methylation    predict the underlying distribution of cell types: a validation    analysis. Epigenetics 8, 816-26 (2013).-   Koestler, D. C. et al. Peripheral blood immune cell methylation    profiles are associated with nonhematopoietic cancers. Cancer    Epidemiol. Biomarkers Prev. 21, 1293-302 (2012).-   Chan, K. C. A. et al. Noninvasive detection of cancer-associated    genome-wide hypomethylation and copy number aberrations by plasma    DNA bisulfite sequencing. Proc. Natl. Acad. Sci. U.S.A 110, 18761-8    (2013).-   Bettegowda, C. et al. (2014) Detection of circulating tumor DNA in    early- and late-stage human malignancies. Sci. Transl. Med., 6,    224ra24.-   Newman, A. M. et al. (2014) An ultrasensitive method for    quantitating circulating tumor DNA with broad patient coverage. Nat.    Med., 20, 548-54.-   Newman, A. M. et al. (2016) Integrated digital error suppression for    improved detection of circulating tumor DNA. Nat. Biotechnol., 34,    547-55.-   Burrell, R. A. et al. (2013) The causes and consequences of genetic    heterogeneity in cancer evolution. Nature, 501, 338-345.-   Turner, N.C. et al. (2012) Genetic heterogeneity and cancer drug    resistance. Lancet Oncol., 13, e178-e185.-   Greenman, C. et al. (2007) Patterns of somatic mutation in human    cancer genomes. Nature, 446, 153-158.-   Schmitt, M. W. et al. (2012) Implications of genetic heterogeneity    in cancer. Ann. N. Y. Acad. Sci., 1267, 110-116.-   Baylin, S. B. et al. (2001) Aberrant patterns of DNA methylation,    chromatin formation and gene expression in cancer. Hum. Mol. Genet.,    10, 687-92.-   Cheishvili, D. et al. (2015) DNA demethylation and invasive cancer:    implications for therapeutics. Br. J. Pharmacol., 172, 2705-15.-   Roy, D. M. et al. (2014) Driver mutations of cancer epigenomes.    Protein Cell, 5, 265-96.-   Plass, C. et al. (2013) Mutations in regulators of the epigenome and    their connections to global chromatin patterns in cancer. Nat. Rev.    Genet., 14, 765-80.-   Smith, L. T. et al. (2007) Unraveling the epigenetic code of cancer    for therapy. Trends Genet., 23, 449-56.-   Sigalotti, L. et al. (2007) Epigenetic drugs as pleiotropic agents    in cancer treatment: biomolecular aspects and clinical    applications. J. Cell. Physiol., 212, 330-44.-   Kang, S. et al. (2017) CancerLocator: non-invasive cancer diagnosis    and tissue-of-origin prediction using methylation profiles of    cell-free DNA. Genome Biol., 18, 53.-   Cancer Genome Atlas Research Network, Weinstein, J. N. et al. (2013)    The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet.,    45, 1113-20.-   Bowman, K. O. et al. (2007) The beta distribution, moment method,    Karl Pearson and RA Fisher. Far East J. Theor. Stat., 23, 133-164.-   Shah, A. et al. (2015) An empirical study of stochastic variational    algorithms for the Beta Bernoulli process. In International    Conference on Machine Learning (ICML).-   Landau, D. A. et al. (2014) Locally Disordered Methylation Forms the    Basis of Intratumor Methylome Variation in Chronic Lymphocytic    Leukemia. Cancer Cell, 26, 813-825.-   Krueger, F. et al. (2011) Bismark: a flexible aligner and    methylation caller for Bisulfite-Seq applications. Bioinformatics,    27, 1571-2.-   Sharma, S. (2009) Tumor markers in clinical practice: General    principles and guidelines. Indian J. Med. Paediatr. Oncol., 30, 1.-   Sturgeon, C. M. et al. (2009) Use of tumor markers in clinical    practice: quality requirements American Association for Clinical    Chemistry.-   Landan, G. et al. (2012) Epigenetic polymorphism and the stochastic    formation of differentially methylated regions in normal and    cancerous tissues. Nat. Genet., 44, 1207-14.-   Li, S. et al. (2016) Distinct evolution and dynamics of epigenetic    and genetic heterogeneity in acute myeloid leukemia. Nat. Med., 22,    792-9.-   Swanton, C. et al. (2014) Epigenetic noise fuels cancer evolution.    Cancer Cell, 26, 775-6.-   Guo, S. et al. (2017) Identification of methylation haplotype blocks    aids in deconvolution of heterogeneous tissue samples and tumor    tissue-of-origin mapping from plasma DNA. Nat. Genet., 49, 635-642.

1. A method of for treating a subject, comprising: (a) receiving aplurality of sequencing reads for a cell-free deoxyribonucleic acid(cfDNA) sample obtained or derived from the subject, wherein each of theplurality of sequencing reads comprises methylation sequencing dataobtained from a nucleic acid sequence; (b) determining a methylationpattern for a sequencing read in the plurality of sequencing reads,wherein the methylation pattern comprises a genomic region correspondingto the nucleic acid sequence and methylation status of one or moremotifs in the genomic region; (c) comparing the methylation pattern witheach of one or more pre-established methylation signatures to computeone or more likelihood scores, wherein each of the one or morepre-established methylation signatures correlates with a cancer, andwherein each pre-established methylation signature comprises at leastone pre-determined signature region and pre-determined methylation rateassociated therewith; (d) characterizing the cfDNA sample as containingcfDNAs derived from the cancer tissue, based at least in part on atleast one of the one or more likelihood scores, thereby identifying thesubject as having the cancer; and (e) administering a treatment to thesubject based on the identifying the subject as having the cancer,wherein the treatment comprises a member selected from the groupconsisting of a chemotherapy, a radiation therapy, an immunotherapy, anda tumor resection.
 2. The method of claim 1, further comprising:performing (b), (c), and (d) for each sequencing read in the pluralityof sequencing reads.
 3. The method of claim 1, further comprising:establishing the one or more pre-established methylation signaturesbased on existing methylation sequencing data.
 4. The method of claim 2,wherein further comprising: determining a level of the cfDNAs derivedfrom the cancer tissue based at least in part on a number of sequencingreads derived from the cancer tissue in the plurality of sequencingreads.
 5. The method of claim 3, wherein the existing methylationsequencing data is selected from the group consisting of tissue-specificsequencing data, disease-specific sequencing data, individual sequencingdata, population sequencing data, and combinations thereof.
 6. Themethod of claim 1, wherein the cfDNA sample is obtained or derived froma plasma sample, a blood sample, a saliva sample, an amniotic fluidsample, a cystic fluid sample, a spinal fluid sample, a brain fluidsample, a urine sample, a sweat sample, or a tears sample from thesubject.
 7. The method of claim 1, wherein the cancer tissue comprises amember selected from the group consisting of, liver cancer tissue, lungcancer tissue, kidney cancer tissue, colon cancer tissue, smallintestines cancer tissue, pancreas cancer tissue, adrenal glands cancertissue, esophagus cancer tissue, adipose cancer tissue, heart cancertissue, brain cancer tissue, placenta cancer tissue, and combinationsthereof.
 8. The method of claim 7, wherein the cancer tissue comprises amember selected from the group consisting of liver cancer tissue, lungcancer tissue, kidney cancer tissue, colon cancer tissue, pancreascancer tissue, brain cancer tissue, and combinations thereof.
 9. Themethod of claim 1, wherein the methylation status and pre-determinedmethylation status are determined at bin level.
 10. The method of claim1, wherein the methylation status and pre-determined methylation statusare determined at CpG site level.
 11. The method of claim 1, wherein theone or more motifs is-comprises a CpG site.
 12. The method of claim 4,further comprising: comparing the level of the cfDNAs derived from thecancer tissue to a first reference level derived from a referencesubject with cancer.
 13. The method of claim 4, further comprising:comparing the level of the cfDNAs derived from the cancer tissue to asecond reference level derived from a reference subject without cancer.14. The method of claim 13, further comprising: determining the secondreference level at least in part by: receiving a second plurality ofsequencing reads for a second cfDNA sample obtained or derived from thereference subject without cancer, wherein each of the second pluralityof sequencing reads comprises second methylation sequencing dataobtained from a second nucleic acid sequence; determining a secondmethylation pattern for a sequencing read in the second plurality ofsequencing reads, wherein the second methylation pattern comprises asecond genomic region corresponding to the second nucleic acid sequenceand methylation status of one or more motifs in the second genomicregion; comparing the second methylation pattern with each of the one ormore pre-established methylation signatures to compute one or moresecond likelihood scores; characterizing the second cfDNA sample ascontaining cfDNAs derived from the cancer tissue, based at least in parton at least one of the one or more second likelihood scores; repeatingthe determining, comparing and characterizing for each sequencing readin the second plurality of sequencing reads; and determining a level ofthe cfDNA derived from the cancer tissue in the reference subjectwithout cancer, based at least in part on a number of sequencing readsderived from the cancer tissue in the second plurality of sequencingreads. 15.-155. (canceled)
 156. The method of claim 1, wherein each ofthe plurality of sequencing reads comprises methylation sequencing dataobtained from a consecutive nucleic acid sequence of 50 or more nucleicacids.
 157. The method of claim 1, wherein (d) further comprisescharacterizing the cfDNA sample as containing cfDNAs derived from thecancer tissue, based at least in part on whether the at least one of theone or more likelihood scores exceeds a threshold value.
 158. The methodof claim 6, wherein the cfDNA sample is obtained or derived from theplasma sample.
 159. The method of claim 6, wherein the cfDNA sample isobtained or derived from the blood sample.
 160. The method of claim 8,wherein the cancer tissue comprises liver cancer tissue.
 161. The methodof claim 8, wherein the cancer tissue comprises lung cancer tissue. 162.The method of claim 8, wherein the cancer tissue comprises kidney cancertissue.
 163. The method of claim 8, wherein the cancer tissue comprisescolon cancer tissue.
 164. The method of claim 8, wherein the cancertissue comprises breast cancer tissue.